Python and Big data - An Introduction to Spark (PySpark)
Hitesh Dharmdasani
About me
• Security Researcher, Malware
Reversing Engineer, Developer
• GIT > GMU > Berkeley 

> FireEye > On Stage
• Bootstrapping a few ideas
• Hiring!
Information

Security
Big

Data
Machine

Learning
Me
What we will talk about?
• What is Spark?
• How does spark do things
• PySpark and data processing primitives
• Example Demo - Playing with Network Logs
• Streaming and Machine Learning in Spark
• When to use Spark
http://bit.do/PyBelgaumSpark
http://tinyurl.com/PyBelgaumSpark
What will we NOT talk about
• Writing production level jobs
• Fine Tuning Spark
• Integrating Spark with Kafka and the like
• Nooks and Crooks of Spark
• But glad to talk about it offline
The Common Scenario
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
Python
Process 1 Process 2 Process 3 Process 4 Process 5 …
You write 1 job. Then chunk,cut, slice and dice
Compute where the data is
• Paradigm shift in computing
• Don't load all the data into one place and do
operations
• State your operations and send code to the
machine
• Sending code to machine >>> Getting data over
network
MapReduce
public static MyFirstMapper {
public void map { . . . }
}
public static MyFirstReducer {
public void reduce { . . . }
}
public static MySecondMapper {
public void map { . . . }
}
public static MySecondReducer {
public void reduce { . . . }
}
Job job = new Job(conf,
“First");
job.setMapperClass(MyFirstMapper
.class);
job.setReducerClass(MyFirstReduc
er.class);
/*Job 1 goes to Disk */
if(job.isSuccessful()) {
Job job2 = new
Job(conf,”Second”);
job2.setMapperClass(MySecondMap
per.class);
job2.setReducerClass(MySecondRe
ducer.class);
}
This also looks ugly if you ask me!
What is Spark?
• Open Source Lighting Fast Cluster Computing
• Focus on Speed and Scale
• Developed at AMP Lab, UC Berkeley by Matei Zaharia
• Most active Apache Project in 2014 (Even more than
Hadoop)
• Recently beat MapReduce in sorting 100TB of data
by being 3X faster and using 10X fewer machines
What is Spark?
Spark
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
Java Python Scala
MLLib Streaming ETL SQL ….GraphX
What is Spark?
Spark
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
• Inherently distributed
• Computation happens where the data
resides
What is different from
MapReduce
• Uses main memory for caching
• Dataset is partitioned and stored in RAM/Disk for
iterative queries
• Large speedups for iterative operations when in-
memory caching is used
Spark Internals
The Init
• Creating a SparkContext
• It is Sparks’ gateway to access the cluster
• In interactive mode. SparkContext is created as ‘sc’
$ pyspark
...
...
SparkContext available as sc.
>>> sc
<pyspark.context.SparkContext at 0xdeadbeef>
Spark Internals
The Key Idea



Resilient Distributed Datasets
• Basic unit of abstraction of data
• Immutable
• Persistance
>>> data = [90, 14, 20, 86, 43, 55, 30, 94 ]
>>> distData = sc.parallelize(data)
ParallelCollectionRDD[13] at parallelize at
PythonRDD.scala:364
Spark Internals
Operations on RDDs - Transformations & Actions
Spark Internals
Transformations RDD
Spark
Context
File/Collection
Spark Internals
Lazy Evaluation
Now what?
Spark Internals
Transformations ActionsRDD
Spark
Context
File/Collection
Spark Internals
Transformation Operations on RDDs
Map
def mapFunc(x):
return x+1
rdd_2 = rdd_1.map(mapFunc)
Filter


def filterFunc(x):
if x % 2 == 0:
return True
else:
return False
rdd_2 = rdd_1.filter(filterFunc)
Spark Internals
Transformation Operations on RDDs
• map
• filter
• flatMap
• mapPartitions
• mapPartitionsWithIndex
• sample
• union
• intersection
• distinct
• groupByKey
Spark Internals
>>> increment_rdd = distData.map(mapFunc)

>>> increment_rdd.collect()

[91, 15, 21, 87, 44, 56, 31, 95]

>>>



>>> increment_rdd.filter(filterFunc).collect()

[44, 56]



OR



>>> distData.map(mapFunc).filter(filterFunc).collect()

[44, 56]
Spark Internals
Fault Tolerance and Lineage
Moving to the Terminal
Spark Streaming
Kafka
Flume
HDFS
Twitter
ZeroMQ
HDFS
Cassandra
NFS
TextFile
RDD
ML Lib
• Machine Learning Primitives in Spark
• Provides training and classification at scale
• Exploits Sparks’ ability for iterative computation
(Linear Regression, Random Forest)
• Currently the most active area of work within Spark
How can I use all this?
HDFS
Spark + ML Lib
Load Tweets
Bad
Tweets
Model
Live Tweets
Good
Bad
Report to Twitter
Spark Streaming
To Spark or not to Spark
• Iterative computations
• “Don't fix something that is not broken”
• Lesser learning barrier
• Large one-time compute
• Single Map Reduce Operation

More Related Content

PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
PySpark Best Practices
PPTX
Spark tutorial
PPTX
Up and running with pyspark
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
PPTX
Programming in Spark using PySpark
PDF
PySaprk
PPTX
Introduction to Apache Spark Developer Training
Performant data processing with PySpark, SparkR and DataFrame API
PySpark Best Practices
Spark tutorial
Up and running with pyspark
Frustration-Reduced PySpark: Data engineering with DataFrames
Programming in Spark using PySpark
PySaprk
Introduction to Apache Spark Developer Training

What's hot (20)

PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Intro to Apache Spark
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Operational Tips for Deploying Spark
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
PPTX
Spark r under the hood with Hossein Falaki
PDF
Introduction to Apache Spark
PDF
Re-Architecting Spark For Performance Understandability
PPTX
Introduction to Apache Spark
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PDF
Fast Data Analytics with Spark and Python
PPTX
Introduction to Apache Spark
PDF
Integrating Existing C++ Libraries into PySpark with Esther Kundin
PDF
Apache Spark 101
Spark Under the Hood - Meetup @ Data Science London
Intro to Apache Spark
Spark Summit EU 2015: Lessons from 300+ production users
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Operational Tips for Deploying Spark
Apache Spark: The Next Gen toolset for Big Data Processing
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Spark r under the hood with Hossein Falaki
Introduction to Apache Spark
Re-Architecting Spark For Performance Understandability
Introduction to Apache Spark
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Keeping Spark on Track: Productionizing Spark for ETL
Fast Data Analytics with Spark and Python
Introduction to Apache Spark
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Apache Spark 101
Ad

Similar to Python and Bigdata - An Introduction to Spark (PySpark) (20)

PDF
Apache Spark Overview
PPTX
Dive into spark2
PDF
How Apache Spark fits into the Big Data landscape
PDF
Lessons from Running Large Scale Spark Workloads
PPTX
OVERVIEW ON SPARK.pptx
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Stefano Baghino - From Big Data to Fast Data: Apache Spark
PDF
Bds session 13 14
PDF
Introduction to apache spark
PPTX
Ten tools for ten big data areas 03_Apache Spark
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PPTX
Spark real world use cases and optimizations
PDF
Dev Ops Training
PPTX
Introduction to real time big data with Apache Spark
PPTX
APACHE SPARK.pptx
PPTX
Big Data tools in practice
PPTX
Apache Spark on HDinsight Training
PPT
Big_data_analytics_NoSql_Module-4_Session
PPTX
Apache Spark Core
Apache Spark Overview
Dive into spark2
How Apache Spark fits into the Big Data landscape
Lessons from Running Large Scale Spark Workloads
OVERVIEW ON SPARK.pptx
Apache spark sneha challa- google pittsburgh-aug 25th
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Bds session 13 14
Introduction to apache spark
Ten tools for ten big data areas 03_Apache Spark
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Spark real world use cases and optimizations
Dev Ops Training
Introduction to real time big data with Apache Spark
APACHE SPARK.pptx
Big Data tools in practice
Apache Spark on HDinsight Training
Big_data_analytics_NoSql_Module-4_Session
Apache Spark Core
Ad

Recently uploaded (20)

PDF
Developing a website for English-speaking practice to English as a foreign la...
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
Configure Apache Mutual Authentication
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
STKI Israel Market Study 2025 version august
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PPTX
TEXTILE technology diploma scope and career opportunities
Developing a website for English-speaking practice to English as a foreign la...
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Build Your First AI Agent with UiPath.pptx
Configure Apache Mutual Authentication
sbt 2.0: go big (Scala Days 2025 edition)
Convolutional neural network based encoder-decoder for efficient real-time ob...
Credit Without Borders: AI and Financial Inclusion in Bangladesh
STKI Israel Market Study 2025 version august
Improvisation in detection of pomegranate leaf disease using transfer learni...
Enhancing plagiarism detection using data pre-processing and machine learning...
Benefits of Physical activity for teenagers.pptx
sustainability-14-14877-v2.pddhzftheheeeee
2018-HIPAA-Renewal-Training for executives
Comparative analysis of machine learning models for fake news detection in so...
NewMind AI Weekly Chronicles – August ’25 Week III
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
OpenACC and Open Hackathons Monthly Highlights July 2025
The influence of sentiment analysis in enhancing early warning system model f...
TEXTILE technology diploma scope and career opportunities

Python and Bigdata - An Introduction to Spark (PySpark)

  • 1. Python and Big data - An Introduction to Spark (PySpark) Hitesh Dharmdasani
  • 2. About me • Security Researcher, Malware Reversing Engineer, Developer • GIT > GMU > Berkeley 
 > FireEye > On Stage • Bootstrapping a few ideas • Hiring! Information
 Security Big
 Data Machine
 Learning Me
  • 3. What we will talk about? • What is Spark? • How does spark do things • PySpark and data processing primitives • Example Demo - Playing with Network Logs • Streaming and Machine Learning in Spark • When to use Spark http://bit.do/PyBelgaumSpark http://tinyurl.com/PyBelgaumSpark
  • 4. What will we NOT talk about • Writing production level jobs • Fine Tuning Spark • Integrating Spark with Kafka and the like • Nooks and Crooks of Spark • But glad to talk about it offline
  • 5. The Common Scenario Some Data (NTFS, NFS, HDFS, Amazon S3 …) Python Process 1 Process 2 Process 3 Process 4 Process 5 … You write 1 job. Then chunk,cut, slice and dice
  • 6. Compute where the data is • Paradigm shift in computing • Don't load all the data into one place and do operations • State your operations and send code to the machine • Sending code to machine >>> Getting data over network
  • 7. MapReduce public static MyFirstMapper { public void map { . . . } } public static MyFirstReducer { public void reduce { . . . } } public static MySecondMapper { public void map { . . . } } public static MySecondReducer { public void reduce { . . . } } Job job = new Job(conf, “First"); job.setMapperClass(MyFirstMapper .class); job.setReducerClass(MyFirstReduc er.class); /*Job 1 goes to Disk */ if(job.isSuccessful()) { Job job2 = new Job(conf,”Second”); job2.setMapperClass(MySecondMap per.class); job2.setReducerClass(MySecondRe ducer.class); } This also looks ugly if you ask me!
  • 8. What is Spark? • Open Source Lighting Fast Cluster Computing • Focus on Speed and Scale • Developed at AMP Lab, UC Berkeley by Matei Zaharia • Most active Apache Project in 2014 (Even more than Hadoop) • Recently beat MapReduce in sorting 100TB of data by being 3X faster and using 10X fewer machines
  • 9. What is Spark? Spark Some Data (NTFS, NFS, HDFS, Amazon S3 …) Java Python Scala MLLib Streaming ETL SQL ….GraphX
  • 10. What is Spark? Spark Some Data (NTFS, NFS, HDFS, Amazon S3 …) • Inherently distributed • Computation happens where the data resides
  • 11. What is different from MapReduce • Uses main memory for caching • Dataset is partitioned and stored in RAM/Disk for iterative queries • Large speedups for iterative operations when in- memory caching is used
  • 12. Spark Internals The Init • Creating a SparkContext • It is Sparks’ gateway to access the cluster • In interactive mode. SparkContext is created as ‘sc’ $ pyspark ... ... SparkContext available as sc. >>> sc <pyspark.context.SparkContext at 0xdeadbeef>
  • 13. Spark Internals The Key Idea
 
 Resilient Distributed Datasets • Basic unit of abstraction of data • Immutable • Persistance >>> data = [90, 14, 20, 86, 43, 55, 30, 94 ] >>> distData = sc.parallelize(data) ParallelCollectionRDD[13] at parallelize at PythonRDD.scala:364
  • 14. Spark Internals Operations on RDDs - Transformations & Actions
  • 18. Spark Internals Transformation Operations on RDDs Map def mapFunc(x): return x+1 rdd_2 = rdd_1.map(mapFunc) Filter 
 def filterFunc(x): if x % 2 == 0: return True else: return False rdd_2 = rdd_1.filter(filterFunc)
  • 19. Spark Internals Transformation Operations on RDDs • map • filter • flatMap • mapPartitions • mapPartitionsWithIndex • sample • union • intersection • distinct • groupByKey
  • 20. Spark Internals >>> increment_rdd = distData.map(mapFunc)
 >>> increment_rdd.collect()
 [91, 15, 21, 87, 44, 56, 31, 95]
 >>>
 
 >>> increment_rdd.filter(filterFunc).collect()
 [44, 56]
 
 OR
 
 >>> distData.map(mapFunc).filter(filterFunc).collect()
 [44, 56]
  • 22. Moving to the Terminal
  • 24. ML Lib • Machine Learning Primitives in Spark • Provides training and classification at scale • Exploits Sparks’ ability for iterative computation (Linear Regression, Random Forest) • Currently the most active area of work within Spark
  • 25. How can I use all this? HDFS Spark + ML Lib Load Tweets Bad Tweets Model Live Tweets Good Bad Report to Twitter Spark Streaming
  • 26. To Spark or not to Spark • Iterative computations • “Don't fix something that is not broken” • Lesser learning barrier • Large one-time compute • Single Map Reduce Operation