SlideShare a Scribd company logo
Python and Big data - An Introduction to Spark (PySpark)
Hitesh Dharmdasani
About me
• Security Researcher, Malware
Reversing Engineer, Developer
• GIT > GMU > Berkeley 

> FireEye > On Stage
• Bootstrapping a few ideas
• Hiring!
Information

Security
Big

Data
Machine

Learning
Me
What we will talk about?
• What is Spark?
• How does spark do things
• PySpark and data processing primitives
• Example Demo - Playing with Network Logs
• Streaming and Machine Learning in Spark
• When to use Spark
http://bit.do/PyBelgaumSpark
http://tinyurl.com/PyBelgaumSpark
What will we NOT talk about
• Writing production level jobs
• Fine Tuning Spark
• Integrating Spark with Kafka and the like
• Nooks and Crooks of Spark
• But glad to talk about it offline
The Common Scenario
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
Python
Process 1 Process 2 Process 3 Process 4 Process 5 …
You write 1 job. Then chunk,cut, slice and dice
Compute where the data is
• Paradigm shift in computing
• Don't load all the data into one place and do
operations
• State your operations and send code to the
machine
• Sending code to machine >>> Getting data over
network
MapReduce
public static MyFirstMapper {
public void map { . . . }
}
public static MyFirstReducer {
public void reduce { . . . }
}
public static MySecondMapper {
public void map { . . . }
}
public static MySecondReducer {
public void reduce { . . . }
}
Job job = new Job(conf,
ā€œFirst");
job.setMapperClass(MyFirstMapper
.class);
job.setReducerClass(MyFirstReduc
er.class);
/*Job 1 goes to Disk */
if(job.isSuccessful()) {
Job job2 = new
Job(conf,ā€Secondā€);
job2.setMapperClass(MySecondMap
per.class);
job2.setReducerClass(MySecondRe
ducer.class);
}
This also looks ugly if you ask me!
What is Spark?
• Open Source Lighting Fast Cluster Computing
• Focus on Speed and Scale
• Developed at AMP Lab, UC Berkeley by Matei Zaharia
• Most active Apache Project in 2014 (Even more than
Hadoop)
• Recently beat MapReduce in sorting 100TB of data
by being 3X faster and using 10X fewer machines
What is Spark?
Spark
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
Java Python Scala
MLLib Streaming ETL SQL ….GraphX
What is Spark?
Spark
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
• Inherently distributed
• Computation happens where the data
resides
What is different from
MapReduce
• Uses main memory for caching
• Dataset is partitioned and stored in RAM/Disk for
iterative queries
• Large speedups for iterative operations when in-
memory caching is used
Spark Internals
The Init
• Creating a SparkContext
• It is Sparks’ gateway to access the cluster
• In interactive mode. SparkContext is created as ā€˜sc’
$ pyspark
...
...
SparkContext available as sc.
>>> sc
<pyspark.context.SparkContext at 0xdeadbeef>
Spark Internals
The Key Idea



Resilient Distributed Datasets
• Basic unit of abstraction of data
• Immutable
• Persistance
>>> data = [90, 14, 20, 86, 43, 55, 30, 94 ]
>>> distData = sc.parallelize(data)
ParallelCollectionRDD[13] at parallelize at
PythonRDD.scala:364
Spark Internals
Operations on RDDs - Transformations & Actions
Spark Internals
Transformations RDD
Spark
Context
File/Collection
Spark Internals
Lazy Evaluation
Now what?
Spark Internals
Transformations ActionsRDD
Spark
Context
File/Collection
Spark Internals
Transformation Operations on RDDs
Map
def mapFunc(x):
return x+1
rdd_2 = rdd_1.map(mapFunc)
Filter


def filterFunc(x):
if x % 2 == 0:
return True
else:
return False
rdd_2 = rdd_1.filter(filterFunc)
Spark Internals
Transformation Operations on RDDs
• map
• filter
• flatMap
• mapPartitions
• mapPartitionsWithIndex
• sample
• union
• intersection
• distinct
• groupByKey
Spark Internals
>>> increment_rdd = distData.map(mapFunc)

>>> increment_rdd.collect()

[91, 15, 21, 87, 44, 56, 31, 95]

>>>



>>> increment_rdd.filter(filterFunc).collect()

[44, 56]



OR



>>> distData.map(mapFunc).filter(filterFunc).collect()

[44, 56]
Spark Internals
Fault Tolerance and Lineage
Moving to the Terminal
Spark Streaming
Kafka
Flume
HDFS
Twitter
ZeroMQ
HDFS
Cassandra
NFS
TextFile
RDD
ML Lib
• Machine Learning Primitives in Spark
• Provides training and classification at scale
• Exploits Sparks’ ability for iterative computation
(Linear Regression, Random Forest)
• Currently the most active area of work within Spark
How can I use all this?
HDFS
Spark + ML Lib
Load Tweets
Bad
Tweets
Model
Live Tweets
Good
Bad
Report to Twitter
Spark Streaming
To Spark or not to Spark
• Iterative computations
• ā€œDon't fix something that is not brokenā€
• Lesser learning barrier
• Large one-time compute
• Single Map Reduce Operation

More Related Content

What's hot (20)

PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
Ā 
PDF
Intro to Apache Spark
BTI360
Ā 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
Ā 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
Ā 
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
Ā 
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
Ā 
PDF
Operational Tips for Deploying Spark
Databricks
Ā 
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
Ā 
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
Ā 
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
Ā 
PPTX
Spark r under the hood with Hossein Falaki
Databricks
Ā 
PDF
Introduction to Apache Spark
Samy Dindane
Ā 
PDF
Re-Architecting Spark For Performance Understandability
Jen Aman
Ā 
PPTX
Introduction to Apache Spark
Hubert Fan Chiang
Ā 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
Ā 
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
Ā 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
Ā 
PPTX
Introduction to Apache Spark
Rahul Jain
Ā 
PDF
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Databricks
Ā 
PDF
Apache Spark 101
Abdullah Ƈetin ƇAVDAR
Ā 
Spark Under the Hood - Meetup @ Data Science London
Databricks
Ā 
Intro to Apache Spark
BTI360
Ā 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
Ā 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
Ā 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
Ā 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
Ā 
Operational Tips for Deploying Spark
Databricks
Ā 
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
Ā 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
Ā 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
Ā 
Spark r under the hood with Hossein Falaki
Databricks
Ā 
Introduction to Apache Spark
Samy Dindane
Ā 
Re-Architecting Spark For Performance Understandability
Jen Aman
Ā 
Introduction to Apache Spark
Hubert Fan Chiang
Ā 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
Ā 
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
Ā 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
Ā 
Introduction to Apache Spark
Rahul Jain
Ā 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Databricks
Ā 
Apache Spark 101
Abdullah Ƈetin ƇAVDAR
Ā 

Similar to Python and Bigdata - An Introduction to Spark (PySpark) (20)

PDF
Apache Spark Overview
Vadim Y. Bichutskiy
Ā 
PPTX
Dive into spark2
Gal Marder
Ā 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
Ā 
PDF
Lessons from Running Large Scale Spark Workloads
Databricks
Ā 
PPTX
OVERVIEW ON SPARK.pptx
Aishg4
Ā 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
Ā 
PDF
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Codemotion
Ā 
PDF
Bds session 13 14
Infinity Tech Solutions
Ā 
PDF
Introduction to apache spark
UserReport
Ā 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
Ā 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
Ā 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
Ā 
PPTX
Spark real world use cases and optimizations
Gal Marder
Ā 
PDF
Dev Ops Training
Spark Summit
Ā 
PPTX
Introduction to real time big data with Apache Spark
Taras Matyashovsky
Ā 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
Ā 
PPTX
Big Data tools in practice
Darko Marjanovic
Ā 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
Ā 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
Ā 
PPTX
Apache Spark Core
Girish Khanzode
Ā 
Apache Spark Overview
Vadim Y. Bichutskiy
Ā 
Dive into spark2
Gal Marder
Ā 
How Apache Spark fits into the Big Data landscape
Paco Nathan
Ā 
Lessons from Running Large Scale Spark Workloads
Databricks
Ā 
OVERVIEW ON SPARK.pptx
Aishg4
Ā 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
Ā 
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Codemotion
Ā 
Bds session 13 14
Infinity Tech Solutions
Ā 
Introduction to apache spark
UserReport
Ā 
Ten tools for ten big data areas 03_Apache Spark
Will Du
Ā 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
Ā 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
Ā 
Spark real world use cases and optimizations
Gal Marder
Ā 
Dev Ops Training
Spark Summit
Ā 
Introduction to real time big data with Apache Spark
Taras Matyashovsky
Ā 
APACHE SPARK.pptx
DeepaThirumurugan
Ā 
Big Data tools in practice
Darko Marjanovic
Ā 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
Ā 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
Ā 
Apache Spark Core
Girish Khanzode
Ā 
Ad

Recently uploaded (20)

PPTX
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
Ā 
PPTX
UserCon Belgium: Honey, VMware increased my bill
stijn40
Ā 
PDF
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
Ā 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
Ā 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
Ā 
PPTX
Practical Applications of AI in Local Government
OnBoard
Ā 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
Ā 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
Ā 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
Ā 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
Ā 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
Ā 
PDF
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
Ā 
PDF
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
Ā 
PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
Ā 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
Ā 
PDF
The Growing Value and Application of FME & GenAI
Safe Software
Ā 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
Ā 
PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
Ā 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
Ā 
PPTX
Simplifica la seguridad en la nube y la detección de amenazas con FortiCNAPP
Cristian Garcia G.
Ā 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
Ā 
UserCon Belgium: Honey, VMware increased my bill
stijn40
Ā 
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
Ā 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
Ā 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
Ā 
Practical Applications of AI in Local Government
OnBoard
Ā 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
Ā 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
Ā 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
Ā 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
Ā 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
Ā 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
Ā 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
Ā 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
Ā 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
Ā 
The Growing Value and Application of FME & GenAI
Safe Software
Ā 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
Ā 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
Ā 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
Ā 
Simplifica la seguridad en la nube y la detección de amenazas con FortiCNAPP
Cristian Garcia G.
Ā 
Ad

Python and Bigdata - An Introduction to Spark (PySpark)

  • 1. Python and Big data - An Introduction to Spark (PySpark) Hitesh Dharmdasani
  • 2. About me • Security Researcher, Malware Reversing Engineer, Developer • GIT > GMU > Berkeley 
 > FireEye > On Stage • Bootstrapping a few ideas • Hiring! Information
 Security Big
 Data Machine
 Learning Me
  • 3. What we will talk about? • What is Spark? • How does spark do things • PySpark and data processing primitives • Example Demo - Playing with Network Logs • Streaming and Machine Learning in Spark • When to use Spark http://bit.do/PyBelgaumSpark http://tinyurl.com/PyBelgaumSpark
  • 4. What will we NOT talk about • Writing production level jobs • Fine Tuning Spark • Integrating Spark with Kafka and the like • Nooks and Crooks of Spark • But glad to talk about it offline
  • 5. The Common Scenario Some Data (NTFS, NFS, HDFS, Amazon S3 …) Python Process 1 Process 2 Process 3 Process 4 Process 5 … You write 1 job. Then chunk,cut, slice and dice
  • 6. Compute where the data is • Paradigm shift in computing • Don't load all the data into one place and do operations • State your operations and send code to the machine • Sending code to machine >>> Getting data over network
  • 7. MapReduce public static MyFirstMapper { public void map { . . . } } public static MyFirstReducer { public void reduce { . . . } } public static MySecondMapper { public void map { . . . } } public static MySecondReducer { public void reduce { . . . } } Job job = new Job(conf, ā€œFirst"); job.setMapperClass(MyFirstMapper .class); job.setReducerClass(MyFirstReduc er.class); /*Job 1 goes to Disk */ if(job.isSuccessful()) { Job job2 = new Job(conf,ā€Secondā€); job2.setMapperClass(MySecondMap per.class); job2.setReducerClass(MySecondRe ducer.class); } This also looks ugly if you ask me!
  • 8. What is Spark? • Open Source Lighting Fast Cluster Computing • Focus on Speed and Scale • Developed at AMP Lab, UC Berkeley by Matei Zaharia • Most active Apache Project in 2014 (Even more than Hadoop) • Recently beat MapReduce in sorting 100TB of data by being 3X faster and using 10X fewer machines
  • 9. What is Spark? Spark Some Data (NTFS, NFS, HDFS, Amazon S3 …) Java Python Scala MLLib Streaming ETL SQL ….GraphX
  • 10. What is Spark? Spark Some Data (NTFS, NFS, HDFS, Amazon S3 …) • Inherently distributed • Computation happens where the data resides
  • 11. What is different from MapReduce • Uses main memory for caching • Dataset is partitioned and stored in RAM/Disk for iterative queries • Large speedups for iterative operations when in- memory caching is used
  • 12. Spark Internals The Init • Creating a SparkContext • It is Sparks’ gateway to access the cluster • In interactive mode. SparkContext is created as ā€˜sc’ $ pyspark ... ... SparkContext available as sc. >>> sc <pyspark.context.SparkContext at 0xdeadbeef>
  • 13. Spark Internals The Key Idea
 
 Resilient Distributed Datasets • Basic unit of abstraction of data • Immutable • Persistance >>> data = [90, 14, 20, 86, 43, 55, 30, 94 ] >>> distData = sc.parallelize(data) ParallelCollectionRDD[13] at parallelize at PythonRDD.scala:364
  • 14. Spark Internals Operations on RDDs - Transformations & Actions
  • 18. Spark Internals Transformation Operations on RDDs Map def mapFunc(x): return x+1 rdd_2 = rdd_1.map(mapFunc) Filter 
 def filterFunc(x): if x % 2 == 0: return True else: return False rdd_2 = rdd_1.filter(filterFunc)
  • 19. Spark Internals Transformation Operations on RDDs • map • filter • flatMap • mapPartitions • mapPartitionsWithIndex • sample • union • intersection • distinct • groupByKey
  • 20. Spark Internals >>> increment_rdd = distData.map(mapFunc)
 >>> increment_rdd.collect()
 [91, 15, 21, 87, 44, 56, 31, 95]
 >>>
 
 >>> increment_rdd.filter(filterFunc).collect()
 [44, 56]
 
 OR
 
 >>> distData.map(mapFunc).filter(filterFunc).collect()
 [44, 56]
  • 22. Moving to the Terminal
  • 24. ML Lib • Machine Learning Primitives in Spark • Provides training and classification at scale • Exploits Sparks’ ability for iterative computation (Linear Regression, Random Forest) • Currently the most active area of work within Spark
  • 25. How can I use all this? HDFS Spark + ML Lib Load Tweets Bad Tweets Model Live Tweets Good Bad Report to Twitter Spark Streaming
  • 26. To Spark or not to Spark • Iterative computations • ā€œDon't fix something that is not brokenā€ • Lesser learning barrier • Large one-time compute • Single Map Reduce Operation