SlideShare a Scribd company logo
PYSPARK IN PRACTICE
PYDATA LONDON 2016
Ronert Obst Senior Data Scientist
Dat Tran Data Scientist
 
 
0
AGENDA
○ Short introduction
○ Data structures
○ Configuration and performance
○ Unit testing with PySpark
○ Data pipeline management and workflows
○ Online learning with PySpark streaming
○ Operationalisation
WHAT IS PYSPARK?
DATA STRUCTURES
DATA STRUCTURES
○ RDDs
○ DataFrames
○ DataSets
DATAFRAMES
○ Built on top of RDD
○ Include metadata
○ Turns PySpark API calls into query plan
○ Less flexibel than RDD
○ Python UDFs impact performance, use builtin functions whenever
possible
○ HiveContext ftw
PYTHON DATAFRAME PERFORMANCE
Source: Databricks - Spark in Production (Aaron Davidson)
CONFIGURATION AND
PERFORMANCE
CONFIGURATION PRECEDENCE
1. Programatically (SparkConf())
2. Command line (spark-submit --master yarn --executor-
memory 20G ...)
3. Configuration files (spark-defaults.conf, spark-env.sh)
SPARK 1.6 MEMORY MANAGEMENT
Source: Spark Memory Management (Alexey Grishchenko) (https://0x0fff.com/spark-
memory-management/)
WORKED EXAMPLE
CLUSTER HARDWARE:
○ 6 nodes with 16 cores and 64 GB RAM each
YARN CONFIG:
○ yarn.nodemanager.resource.memory-mb: 61 GB
○ yarn.nodemanager.resource.cpu-vcores: 15
SPARK-DEFAULTS.CONF
○ num-executors: 4 executors per node * 6 nodes = 24
○ executor-cores: 6
○ executor-memory: 11600 MB
PYTHON OOM
More Python memory, e.g. scikit-learn
○ Reduce concurrency (4 executors per node with 4 executor cores)
○ Set spark.python.worker.memory = 1024 MB (default is 512 MB)
○ Leaves spark.executor.memory = 10600 MB
OTHER USEFUL SETTINGS
○ Save substantial memory: spark.rdd.compress = true
○ Relaunch stragglers: spark.speculation = true
○ Fix Python 3 hashseed issue:
spark.executorEnv.PYTHONHASHSEED = 0
TUNING PERFORMANCE
○ Re-use RDDs with cache()(set storage level to MEMORY_AND_DISK)
○ Avoid groupByKey() on RDDs
○ Distribute by key data =
sql_context.read.parquet(hdfs://...).repartition(N_PARTITIONS,
"key")
○ Use treeReduceand treeAggregateinstead of reduceand aggregate
○ sc.broadcast(broadcast_var)small tables containing frequently accessed
data
UNIT TESTING WITH PYSPARK
MOST DATA SCIENTIST WRITE CODE THAT
JUST WORKS
BUGS ARE EVERYWHERE...
TDD IS THE SOLUTION!
THE REALITY CAN BE DIFFICULT...
EXPLORATION PHASE
PRODUCTION PHASE
# Import script modules here
import clustering
class ClusteringTest(unittest.TestCase):
def setUp(self):
"""Create a single node Spark application."""
conf = SparkConf()
conf.set("spark.executor.memory", "1g")
conf.set("spark.cores.max", "1")
conf.set("spark.app.name", "nosetest")
self.sc = SparkContext(conf=conf)
self.mock_df = self.mock_data()
def tearDown(self):
"""Stop the SparkContext."""
self.sc.stop()
TEST:
IMPLEMENTATION:
Full example:
def test_assign_cluster(self):
"""Check if rows are labeled are as expected."""
input_df = clustering.convert_df(self.sc, self.mock_df)
scaled_df = clustering.rescale_df(input_df)
label_df = clustering.assign_cluster(scaled_df)
self.assertEqual(label_df.map(lambda x: x.label).collect(), [0, 0, 0,
1, 1])
def assign_cluster(data):
"""Train kmeans on rescaled data and then label the rescaled data."""
kmeans = KMeans(k=2, seed=1, featuresCol="features_scaled", predictio
nCol="label")
model = kmeans.fit(data)
label_df = model.transform(data)
return label_df
https://github.com/datitran/spark-tdd-example
(https://github.com/datitran/spark-tdd-example)
DATA PIPELINE MANAGEMENT
AND WORKFLOWS
PIPELINE MANAGEMENT CAN BE
DIFFICULT...
IN THE PAST...
CUSTOM LOGGERS
○ 100% Python
○ Web dashboard
○ Parallel workers
○ Central scheduler
○ Many templates for various tasks e.g. Spark, SQL etc.
○ Well configured logger
○ Parameters
EXAMPLE SPARK TASK
class ClusteringTask(SparkSubmitTask):
"""Assign cluster label to some data."""
date = luigi.DateParameter()
num_k = luigi.IntParameter()
name = "Clustering with PySpark"
app = "../clustering.py"
def app_options(self):
return [self.num_k]
def requires(self):
return [FeatureEngineeringTask(date=self.date)]
def output(self):
return HdfsTarget("hdfs://...")
EXAMPLE CONFIG FILE
client.cfg
[resources]
spark: 2
[spark]
master: yarn
executor-cores: 3
num-executors: 11
executor-memory: 20G
WITH LUIGI
CAVEATS
○ No built-in job trigger
○ Documentation could be better
○ Logging can be too much when having multiple workers
○ Duplicated parameters in all downstream tasks
ONLINE LEARNING WITH
PYSPARK STREAMING
ONLINE LEARNING WITH PYSPARK
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import StreamingLinearRegressionWithSGD
def parse(lp):
label = float(lp[lp.find('(') + 1: lp.find(',')])
vec = Vectors.dense(lp[lp.find('[') + 1: lp.find(']')].split(','))
return LabeledPoint(label, vec)
stream = scc.socketTextStream("localhost", 9999).map(parse)
numFeatures = 3
model = StreamingLinearRegressionWithSGD()
model.setInitialWeights([0.0, 0.0, 0.0])
model.trainOn(stream)
print(model.predictOn(stream.map(lambda lp: (lp.label, lp.features))))
ssc.start()
ssc.awaitTermination()
OPERATIONALISATION
RESTFUL API EXAMPLE WITH FLASK
import pandas as pd
from flask import Flask, request
app = Flask(__name__)
# Load model stored on redis or hdfs
app.model = load_model()
def predict(model, features):
"""Use the trained cluster centers to assign cluster to new feature
s."""
differences = pd.DataFrame(
model.values - features.values, columns=model.columns)
differences_square = differences.apply(lambda x: x**2)
col_sums = differences_square.sum(axis=1)
label = col_sums.idxmin()
return int(label)
@app.route("/clustering", methods=["POST"])
def clustering():
"""
curl -i -H "Content-Type: application/json" -X POST -d @example.json
"x.x.x.x:5000/clustering"
"""
features_json = request.json
features_df = pd.DataFrame(features_json)
return str(predict(load_model(), feature_df))
DEPLOY YOUR APPS WITH CF PUSH
○ Platform as a Service (PaaS)
○ "Here is my source code. Run it on the cloud for me. I do not care
how." (Onsi Fakhouri)
○ Trial version:
○ Python Cloud Foundry examples:
https://run.pivotal.io/ (https://run.pivotal.io/)
https://github.com/ihuston/python-
cf-examples (https://github.com/ihuston/python-cf-examples)
PySpark in practice slides
PySpark in practice slides

More Related Content

PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Apache Spark Architecture
PDF
Introduction to Spark with Python
PPTX
Introduction to Apache Spark
PDF
Apache Spark Overview
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Apache Spark Architecture
Introduction to Spark with Python
Introduction to Apache Spark
Apache Spark Overview
Processing Large Data with Apache Spark -- HasGeek
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Apache Spark in Depth: Core Concepts, Architecture & Internals

What's hot (20)

PPTX
Apache Flink and what it is used for
PDF
Introduction to Apache Spark
PDF
Introduction to apache spark
PPTX
Apache Spark Core
PPTX
PySpark dataframe
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PPTX
Spark architecture
PDF
Apache Spark Introduction
PPTX
Apache Airflow overview
PDF
Spark overview
PPTX
Apache Spark overview
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PPTX
Microsoft Azure Databricks
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
PDF
Apache Spark Crash Course
PDF
Databricks Delta Lake and Its Benefits
PPTX
Evening out the uneven: dealing with skew in Flink
Apache Flink and what it is used for
Introduction to Apache Spark
Introduction to apache spark
Apache Spark Core
PySpark dataframe
Apache Tez - A New Chapter in Hadoop Data Processing
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Spark architecture
Apache Spark Introduction
Apache Airflow overview
Spark overview
Apache Spark overview
Deep Dive: Memory Management in Apache Spark
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Microsoft Azure Databricks
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Apache Spark Crash Course
Databricks Delta Lake and Its Benefits
Evening out the uneven: dealing with skew in Flink
Ad

Similar to PySpark in practice slides (20)

PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Spark Meetup
PPTX
Apache Spark Workshop
PDF
Introduction to Spark Training
PPTX
Intro to Spark development
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
10 things i wish i'd known before using spark in production
PPTX
HDPCD Spark using Python (pyspark)
PPTX
Spark 计算模型
PDF
Apache Spark Tutorial
PDF
Fast Data Analytics with Spark and Python
PDF
Spark Gotchas and Lessons Learned (2/20/20)
PDF
Scala Meetup Hamburg - Spark
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
PySpark Best Practices
PDF
TriHUG talk on Spark and Shark
PPTX
Spark Gotchas and Lessons Learned
PDF
How does that PySpark thing work? And why Arrow makes it faster?
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
Apache spark sneha challa- google pittsburgh-aug 25th
Spark Meetup
Apache Spark Workshop
Introduction to Spark Training
Intro to Spark development
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
10 things i wish i'd known before using spark in production
HDPCD Spark using Python (pyspark)
Spark 计算模型
Apache Spark Tutorial
Fast Data Analytics with Spark and Python
Spark Gotchas and Lessons Learned (2/20/20)
Scala Meetup Hamburg - Spark
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PySpark Best Practices
TriHUG talk on Spark and Shark
Spark Gotchas and Lessons Learned
How does that PySpark thing work? And why Arrow makes it faster?
Frustration-Reduced PySpark: Data engineering with DataFrames
Ad

Recently uploaded (20)

PDF
Computing-Curriculum for Schools in Ghana
PDF
01-Introduction-to-Information-Management.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Cell Types and Its function , kingdom of life
PPTX
master seminar digital applications in india
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
Anesthesia in Laparoscopic Surgery in India
Computing-Curriculum for Schools in Ghana
01-Introduction-to-Information-Management.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
RMMM.pdf make it easy to upload and study
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Pharma ospi slides which help in ospi learning
Cell Types and Its function , kingdom of life
master seminar digital applications in india
Abdominal Access Techniques with Prof. Dr. R K Mishra
202450812 BayCHI UCSC-SV 20250812 v17.pptx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Yogi Goddess Pres Conference Studio Updates
Anesthesia in Laparoscopic Surgery in India

PySpark in practice slides

  • 1. PYSPARK IN PRACTICE PYDATA LONDON 2016 Ronert Obst Senior Data Scientist Dat Tran Data Scientist     0
  • 2. AGENDA ○ Short introduction ○ Data structures ○ Configuration and performance ○ Unit testing with PySpark ○ Data pipeline management and workflows ○ Online learning with PySpark streaming ○ Operationalisation
  • 5. DATA STRUCTURES ○ RDDs ○ DataFrames ○ DataSets
  • 6. DATAFRAMES ○ Built on top of RDD ○ Include metadata ○ Turns PySpark API calls into query plan ○ Less flexibel than RDD ○ Python UDFs impact performance, use builtin functions whenever possible ○ HiveContext ftw
  • 7. PYTHON DATAFRAME PERFORMANCE Source: Databricks - Spark in Production (Aaron Davidson)
  • 9. CONFIGURATION PRECEDENCE 1. Programatically (SparkConf()) 2. Command line (spark-submit --master yarn --executor- memory 20G ...) 3. Configuration files (spark-defaults.conf, spark-env.sh)
  • 10. SPARK 1.6 MEMORY MANAGEMENT Source: Spark Memory Management (Alexey Grishchenko) (https://0x0fff.com/spark- memory-management/)
  • 11. WORKED EXAMPLE CLUSTER HARDWARE: ○ 6 nodes with 16 cores and 64 GB RAM each YARN CONFIG: ○ yarn.nodemanager.resource.memory-mb: 61 GB ○ yarn.nodemanager.resource.cpu-vcores: 15 SPARK-DEFAULTS.CONF ○ num-executors: 4 executors per node * 6 nodes = 24 ○ executor-cores: 6 ○ executor-memory: 11600 MB
  • 12. PYTHON OOM More Python memory, e.g. scikit-learn ○ Reduce concurrency (4 executors per node with 4 executor cores) ○ Set spark.python.worker.memory = 1024 MB (default is 512 MB) ○ Leaves spark.executor.memory = 10600 MB
  • 13. OTHER USEFUL SETTINGS ○ Save substantial memory: spark.rdd.compress = true ○ Relaunch stragglers: spark.speculation = true ○ Fix Python 3 hashseed issue: spark.executorEnv.PYTHONHASHSEED = 0
  • 14. TUNING PERFORMANCE ○ Re-use RDDs with cache()(set storage level to MEMORY_AND_DISK) ○ Avoid groupByKey() on RDDs ○ Distribute by key data = sql_context.read.parquet(hdfs://...).repartition(N_PARTITIONS, "key") ○ Use treeReduceand treeAggregateinstead of reduceand aggregate ○ sc.broadcast(broadcast_var)small tables containing frequently accessed data
  • 15. UNIT TESTING WITH PYSPARK
  • 16. MOST DATA SCIENTIST WRITE CODE THAT JUST WORKS
  • 18. TDD IS THE SOLUTION!
  • 19. THE REALITY CAN BE DIFFICULT...
  • 21. PRODUCTION PHASE # Import script modules here import clustering class ClusteringTest(unittest.TestCase): def setUp(self): """Create a single node Spark application.""" conf = SparkConf() conf.set("spark.executor.memory", "1g") conf.set("spark.cores.max", "1") conf.set("spark.app.name", "nosetest") self.sc = SparkContext(conf=conf) self.mock_df = self.mock_data() def tearDown(self): """Stop the SparkContext.""" self.sc.stop()
  • 22. TEST: IMPLEMENTATION: Full example: def test_assign_cluster(self): """Check if rows are labeled are as expected.""" input_df = clustering.convert_df(self.sc, self.mock_df) scaled_df = clustering.rescale_df(input_df) label_df = clustering.assign_cluster(scaled_df) self.assertEqual(label_df.map(lambda x: x.label).collect(), [0, 0, 0, 1, 1]) def assign_cluster(data): """Train kmeans on rescaled data and then label the rescaled data.""" kmeans = KMeans(k=2, seed=1, featuresCol="features_scaled", predictio nCol="label") model = kmeans.fit(data) label_df = model.transform(data) return label_df https://github.com/datitran/spark-tdd-example (https://github.com/datitran/spark-tdd-example)
  • 24. PIPELINE MANAGEMENT CAN BE DIFFICULT...
  • 27. ○ 100% Python ○ Web dashboard ○ Parallel workers ○ Central scheduler ○ Many templates for various tasks e.g. Spark, SQL etc. ○ Well configured logger ○ Parameters
  • 28. EXAMPLE SPARK TASK class ClusteringTask(SparkSubmitTask): """Assign cluster label to some data.""" date = luigi.DateParameter() num_k = luigi.IntParameter() name = "Clustering with PySpark" app = "../clustering.py" def app_options(self): return [self.num_k] def requires(self): return [FeatureEngineeringTask(date=self.date)] def output(self): return HdfsTarget("hdfs://...")
  • 29. EXAMPLE CONFIG FILE client.cfg [resources] spark: 2 [spark] master: yarn executor-cores: 3 num-executors: 11 executor-memory: 20G
  • 31. CAVEATS ○ No built-in job trigger ○ Documentation could be better ○ Logging can be too much when having multiple workers ○ Duplicated parameters in all downstream tasks
  • 33. ONLINE LEARNING WITH PYSPARK from pyspark.mllib.linalg import Vectors from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.regression import StreamingLinearRegressionWithSGD def parse(lp): label = float(lp[lp.find('(') + 1: lp.find(',')]) vec = Vectors.dense(lp[lp.find('[') + 1: lp.find(']')].split(',')) return LabeledPoint(label, vec) stream = scc.socketTextStream("localhost", 9999).map(parse) numFeatures = 3 model = StreamingLinearRegressionWithSGD() model.setInitialWeights([0.0, 0.0, 0.0]) model.trainOn(stream) print(model.predictOn(stream.map(lambda lp: (lp.label, lp.features)))) ssc.start() ssc.awaitTermination()
  • 35. RESTFUL API EXAMPLE WITH FLASK import pandas as pd from flask import Flask, request app = Flask(__name__) # Load model stored on redis or hdfs app.model = load_model() def predict(model, features): """Use the trained cluster centers to assign cluster to new feature s.""" differences = pd.DataFrame( model.values - features.values, columns=model.columns) differences_square = differences.apply(lambda x: x**2) col_sums = differences_square.sum(axis=1) label = col_sums.idxmin() return int(label)
  • 36. @app.route("/clustering", methods=["POST"]) def clustering(): """ curl -i -H "Content-Type: application/json" -X POST -d @example.json "x.x.x.x:5000/clustering" """ features_json = request.json features_df = pd.DataFrame(features_json) return str(predict(load_model(), feature_df))
  • 37. DEPLOY YOUR APPS WITH CF PUSH ○ Platform as a Service (PaaS) ○ "Here is my source code. Run it on the cloud for me. I do not care how." (Onsi Fakhouri) ○ Trial version: ○ Python Cloud Foundry examples: https://run.pivotal.io/ (https://run.pivotal.io/) https://github.com/ihuston/python- cf-examples (https://github.com/ihuston/python-cf-examples)