SlideShare a Scribd company logo
The Nitty Gritty of Advanced
Analytics
Using Apache Spark in Python
Miklos Christine
Solutions Architect
mwc@databricks.com, @Miklos_C
About Me
Miklos Christine
Solutions Architect @ Databricks
- mwc@databricks.com
- Miklos_C@twitter
Systems Engineer @ Cloudera
Supported a few of the largest clusters in the world
Software Engineer @ Cisco
UC Berkeley Graduate
We are Databricks, the company behind Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads &
environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
2012
started
@
Berkeley
2010
research
paper
2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 &
libraries
(SQL, ML, GraphX)
2015
DataFrames
Tungsten
ML
Pipelines
2016
Spark
2.0
Spark Community Growth
• Spark Survey 2015
Highlights
• End of Year Spark Highlights
2015: A Great Year for Spark
Most active open source project in (big) data
• 1000+ code contributors
New language: R
Widespread industry support & adoption
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
HOW RESPONDENTS ARE RUNNING SPARK
51%
on a public cloud
TOP ROLES USING SPARK
of respondents identify
themselves as Data Engineers
41%
of respondents identify
themselves as Data Scientists
22%
Spark User
Highlights
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN
FRANCISCO
Source: Slide 5 of Spark Community Update
Large-Scale Usage
Largest cluster:
8000 Nodes (Tencent)
Largest single job:
1 PB (Alibaba, Databricks)
Top Streaming
Intake:
1 TB/hour (HHMI
Janelia Farm)
2014 On-Disk Sort Record
Fastest Open Source
Engine for sorting a PB
Spark API Performance
History of Spark APIs
RDD
(2011)
DataFrame
(2013)
Distribute collection
of JVM objects
Functional Operators (map,
filter, etc.)
Distribute collection
of Row objects
Expression-based
operations and UDFs
Logical plans and optimizer
Fast/efficient internal
representations
DataSet
(2015)
Internally rows, externally
JVM objects
Almost the “Best of both
worlds”: type safe + fast
But slower than DF
Not as good for interactive
analysis, especially Python
Benefit of Logical Plan:
Performance Parity Across Languages
DataFrame
RDD
ETL with Spark
ETL: Extract, Transform, Load
● Key factor for big data platforms
● Provides Speed Improvements in All Workloads
● Typically Executed by Data Engineers
File Formats
● Text File Formats
○ CSV
○ JSON
● Avro Row Format
● Parquet Columnar Format
File Formats + Compression
● File Formats
○ JSON
○ CSV
○ Avro
○ Parquet
● Compression Codecs
○ No compression
○ Snappy
○ Gzip
○ LZO
● Industry Standard File Format: Parquet
○ Write to Parquet:
df.write.format(“parquet”).save(“namesAndAges.parquet”)
df.write.format(“parquet”).saveAsTable(“myTestTable”)
○ For compression:
spark.sql.parquet.compression.codec = (gzip, snappy)
Spark Parquet Properties
Small Files Problem
● Small files problem still exists
● Metadata loading
● APIs:
df.coalesce(N)
df.repartition(N)
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
● RDD / DataFrame Partitions
df.rdd.getNumPartitions()
● SparkSQL Shuffle Partitions
spark.sql.shuffle.partitions
● Table Level Partitions
df.write.partitionBy(“year”).
save(“data.parquet”)
All About Partitions
# CSV
df = sqlContext.read.
format('com.databricks.spark.csv').
options(header='true', inferSchema='true').
load('/path/to/data')
# JSON
df = sqlContext.read.json("/tmp/test.json")
df.write.json("/tmp/test_output.json")
PySpark ETL APIs - Text Formats
PySpark ETL APIs - Container Formats
# Binary Container Formats
# Avro
df = sqlContext.read.
format("com.databricks.spark.avro").
load("/path/to/files/")
# Parquet
df = sqlContext.read.parquet("/path/to/files/")
df.write.parquet("/path/to/files/")
● Manage Number of Files
○ APIs manage the number of files per directory
df.repartition(80).
write.
parquet("/path/to/parquet/")
df.repartition(80)
partitionBy("year")
write.
parquet("/path/to/parquet/")
PySpark ETL APIs
Common ETL Problems
● Malformed JSON Records
sqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE
_corrupt_record IS NOT NULL")
● Mismatched DataFrame Schema
○ Null Representation vs Schema DataType
● Many Small Files / No Partition Strategy
○ Parquet Files: ~128MB - 256MB Compressed
Ref: https://databricks.gitbooks.io/databricks-spark-knowledge-
base/content/best_practices/dealing_with_bad_data.html
Debugging Spark
Spark Driver Error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed
4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal):
java.nio.channels.ClosedChannelException
Spark Executor Error:
16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task.
java.text.ParseException: Unparseable number: "N"
at java.text.NumberFormat.parse(NumberFormat.java:385)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at scala.util.Try.getOrElse(Try.scala:77)
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)
Debugging Spark
SQL with Spark
SparkSQL Best Practices
● DataFrames and SparkSQL are synonyms
● Use builtin functions instead of custom UDFs
○ import pyspark.sql.functions
● Examples:
○ to_date()
○ get_json_object()
○ regexp_extract()
○ hour() / minute()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
SparkSQL Best Practices
● Large Table Joins
○ Largest Table on LHS
○ Increase Spark Shuffle Partitions
○ Leverage “cluster by” API included in Spark 1.6
sqlCtx.sql("select * from large_table_1 cluster by num1")
.registerTempTable("sorted_large_table_1");
sqlCtx.sql(“cache table sorted_large_table_1”);
PySpark API Best Practices
● User Defined Functions (UDFs)
from pyspark.sql import functions as F
add_n = udf(lambda x, y: x + y, IntegerType())
# We register a UDF that adds a column to the DataFrame, and we cast the
id column to an Integer type.
df = df.withColumn('id_offset',
add_n( F.lit(1000), df.id.cast(IntegerType())))
PySpark API Best Practices
● Built-in Functions
corpus_df = df.select( 
F.lower( F.col('body')).alias('corpus'), 
F.monotonicallyIncreasingId().alias('id'))
corpus_df = df.select( 
F.date_format( F.from_utc_timestamp( 
F.from_unixtime(F.col('created_utc'), "PST"), 'EEEE')).alias
('dayofweek'))
Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
PySpark API Best Practices
● User Defined Functions (UDFs)
def squared(s):
return s * s
sqlContext.udf.register("squaredWithPython", squared)
display(df.select("id", squared_udf("id").alias("id_squared")))
ML with Spark
Data Science Time
40
Why Spark ML
Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)
Advantages of MLlib’s Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility
High-level functionality in MLlib
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent
itemsets
42
Workflow utilities
• Model import/export
• Pipelines
• DataFrames
• Cross validation
Data utilities
• Feature
extraction &
selection
• Statistics
• Linear algebra
Machine Learning: What and Why?
ML uses data to identify patterns and make decisions.
Core value of ML is automated decision making
• Especially important when dealing with TB or PB of data
Many Use Cases including:
• Marketing and advertising optimization
• Security monitoring / fraud detection
• Operational optimizations
Algorithm coverage in MLlib
Classification
• Logistic regression w/ elastic net
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
• Multilayer perceptron
• One-vs-rest
Regression
• Least squares w/ elastic net
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
Recommendation
• Alternating Least Squares
Frequent itemsets
• FP-growth
• Prefix span
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors & matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Model import/export
Pipelines
Feature extraction &
selection
• Binarizer
• Bucketizer
• Chi-Squared selection
• CountVectorizer
• Discrete cosine transform
• ElementwiseProduct
• Hashing term frequency
• Inverse document frequency
• MinMaxScaler
• Ngram
• Normalizer
• One-Hot Encoder
• PCA
• PolynomialExpansion
• RFormula
• SQLTransformer
• Standard scaler
• StopWordsRemover
• StringIndexer
• Tokenizer
• StringIndexer
• VectorAssembler
• VectorIndexer
• VectorSlicer
• Word2Vec
List based on Spark
1.5 44
Spark ML Best Practices
● Spark MLLib vs SparkML
○ Understand the differences
● Don’t Pipeline Too Many Stages
○ Check Results Between Stages
PySpark ML API Best Practices
PySpark ML API Best Practices
● DataFrame to RDD Mapping
def tokenize(text):
tokens = word_tokenize(text)
lowercased = [t.lower() for t in tokens]
no_punctuation = []
for word in lowercased:
punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])
no_punctuation.append(punct_removed)
no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]
stemmed = [STEMMER.stem(w) for w in no_stopwords]
return [w for w in stemmed if w]
rdd = wordsDataFrame.map(lambda x: (x.__getitem__('id'), tokenize(x.__getitem__('corpus'))))
PySpark ML API Best Practices
Learning more about MLlib
Guides & examples
• Example workflow using ML Pipelines (Python)
• The above 2 links are part of the Databricks Guide, which contains many more
examples and references.
References
• Apache Spark MLlib User Guide
• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API
documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.
org/abs/1505.06807 (academic paper)
49
Spark Demo
Thanks!
Sign Up For Databricks Community Edition!
http://go.databricks.com/databricks-community-
edition-beta-waitlist

More Related Content

PPTX
data science toolkit 101: set up Python, Spark, & Jupyter
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PPTX
ETL with SPARK - First Spark London meetup
PDF
PySpark Best Practices
data science toolkit 101: set up Python, Spark, & Jupyter
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Spark Summit EU talk by Miklos Christine paddling up the stream
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Keeping Spark on Track: Productionizing Spark for ETL
Apache Spark: The Next Gen toolset for Big Data Processing
ETL with SPARK - First Spark London meetup
PySpark Best Practices

What's hot (20)

PDF
PySaprk
PDF
Integrating Existing C++ Libraries into PySpark with Esther Kundin
PDF
How To Connect Spark To Your Own Datasource
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
PDF
Hadoop to spark-v2
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
Fast Data Analytics with Spark and Python
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
PDF
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
PDF
Spark Meetup at Uber
PDF
Operational Tips for Deploying Spark
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PPTX
Programming in Spark using PySpark
PPTX
A Developer’s View into Spark's Memory Model with Wenchen Fan
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PySaprk
Integrating Existing C++ Libraries into PySpark with Esther Kundin
How To Connect Spark To Your Own Datasource
Spark Under the Hood - Meetup @ Data Science London
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Hadoop to spark-v2
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Fast Data Analytics with Spark and Python
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Meetup at Uber
Operational Tips for Deploying Spark
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Programming in Spark using PySpark
A Developer’s View into Spark's Memory Model with Wenchen Fan
Unified Big Data Processing with Apache Spark (QCON 2014)
Ad

Viewers also liked (20)

PDF
HOW Series: Knights Landing
PPT
SparkSQL et Cassandra - Tool In Action Devoxx 2015
PDF
The SparkSQL things you maybe confuse
PDF
Preparing Codes for Intel Knights Landing (KNL)
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
PDF
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
PPTX
HBaseConEast2016: HBase and Spark, State of the Art
PPTX
The DAP - Where YARN, HBase, Kafka and Spark go to Production
PDF
Cassandra and Spark
PPTX
Building a modern Application with DataFrames
PDF
Introduction to Apache Spark
PPTX
Presentation of Apache Cassandra
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PPTX
Spark meetup v2.0.5
KEY
Cassandra Basics: Indexing
PDF
Introduction to Cassandra - Denver
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
KEY
Developers summit cassandraで見るNoSQL
PDF
Intro to py spark (and cassandra)
PDF
Diagnosing Problems in Production: Cassandra Summit 2014
HOW Series: Knights Landing
SparkSQL et Cassandra - Tool In Action Devoxx 2015
The SparkSQL things you maybe confuse
Preparing Codes for Intel Knights Landing (KNL)
Getting started with SparkSQL - Desert Code Camp 2016
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
HBaseConEast2016: HBase and Spark, State of the Art
The DAP - Where YARN, HBase, Kafka and Spark go to Production
Cassandra and Spark
Building a modern Application with DataFrames
Introduction to Apache Spark
Presentation of Apache Cassandra
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Spark meetup v2.0.5
Cassandra Basics: Indexing
Introduction to Cassandra - Denver
SparkR - Play Spark Using R (20160909 HadoopCon)
Developers summit cassandraで見るNoSQL
Intro to py spark (and cassandra)
Diagnosing Problems in Production: Cassandra Summit 2014
Ad

Similar to The Nitty Gritty of Advanced Analytics Using Apache Spark in Python (20)

PDF
20170126 big data processing
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPTX
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
New Developments in Spark
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Running Presto and Spark on the Netflix Big Data Platform
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
Spark + AI Summit 2020 イベント概要
PDF
Apache Spark for Everyone - Women Who Code Workshop
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
PDF
Productionalizing a spark application
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PDF
Spark DataFrames and ML Pipelines
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PPTX
Processing Large Data with Apache Spark -- HasGeek
20170126 big data processing
Apache spark-melbourne-april-2015-meetup
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Getting started with Apache Spark in Python - PyLadies Toronto 2016
New Developments in Spark
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Running Presto and Spark on the Netflix Big Data Platform
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Jump Start into Apache® Spark™ and Databricks
Spark + AI Summit 2020 イベント概要
Apache Spark for Everyone - Women Who Code Workshop
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Productionalizing a spark application
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Spark DataFrames and ML Pipelines
Jump Start with Apache Spark 2.0 on Databricks
Processing Large Data with Apache Spark -- HasGeek

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Transforming Manufacturing operations through Intelligent Integrations
PPTX
MYSQL Presentation for SQL database connectivity
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
Cloud computing and distributed systems.
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced IT Governance
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectroscopy.pptx food analysis technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 3 Spatial Domain Image Processing.pdf
Transforming Manufacturing operations through Intelligent Integrations
MYSQL Presentation for SQL database connectivity
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
Cloud computing and distributed systems.
Sensors and Actuators in IoT Systems using pdf
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced IT Governance
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

  • 1. The Nitty Gritty of Advanced Analytics Using Apache Spark in Python Miklos Christine Solutions Architect [email protected], @Miklos_C
  • 2. About Me Miklos Christine Solutions Architect @ Databricks - [email protected] - Miklos_C@twitter Systems Engineer @ Cloudera Supported a few of the largest clusters in the world Software Engineer @ Cisco UC Berkeley Graduate
  • 3. We are Databricks, the company behind Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 3 Data Value Created Databricks on top of Spark to make big data simple.
  • 4. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 6. 2012 started @ Berkeley 2010 research paper 2013 Databricks started & donated to ASF 2014 Spark 1.0 & libraries (SQL, ML, GraphX) 2015 DataFrames Tungsten ML Pipelines 2016 Spark 2.0
  • 7. Spark Community Growth • Spark Survey 2015 Highlights • End of Year Spark Highlights
  • 8. 2015: A Great Year for Spark Most active open source project in (big) data • 1000+ code contributors New language: R Widespread industry support & adoption
  • 13. HOW RESPONDENTS ARE RUNNING SPARK 51% on a public cloud TOP ROLES USING SPARK of respondents identify themselves as Data Engineers 41% of respondents identify themselves as Data Scientists 22%
  • 15. NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
  • 16. Large-Scale Usage Largest cluster: 8000 Nodes (Tencent) Largest single job: 1 PB (Alibaba, Databricks) Top Streaming Intake: 1 TB/hour (HHMI Janelia Farm) 2014 On-Disk Sort Record Fastest Open Source Engine for sorting a PB
  • 18. History of Spark APIs RDD (2011) DataFrame (2013) Distribute collection of JVM objects Functional Operators (map, filter, etc.) Distribute collection of Row objects Expression-based operations and UDFs Logical plans and optimizer Fast/efficient internal representations DataSet (2015) Internally rows, externally JVM objects Almost the “Best of both worlds”: type safe + fast But slower than DF Not as good for interactive analysis, especially Python
  • 19. Benefit of Logical Plan: Performance Parity Across Languages DataFrame RDD
  • 21. ETL: Extract, Transform, Load ● Key factor for big data platforms ● Provides Speed Improvements in All Workloads ● Typically Executed by Data Engineers
  • 22. File Formats ● Text File Formats ○ CSV ○ JSON ● Avro Row Format ● Parquet Columnar Format
  • 23. File Formats + Compression ● File Formats ○ JSON ○ CSV ○ Avro ○ Parquet ● Compression Codecs ○ No compression ○ Snappy ○ Gzip ○ LZO
  • 24. ● Industry Standard File Format: Parquet ○ Write to Parquet: df.write.format(“parquet”).save(“namesAndAges.parquet”) df.write.format(“parquet”).saveAsTable(“myTestTable”) ○ For compression: spark.sql.parquet.compression.codec = (gzip, snappy) Spark Parquet Properties
  • 25. Small Files Problem ● Small files problem still exists ● Metadata loading ● APIs: df.coalesce(N) df.repartition(N) Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
  • 26. ● RDD / DataFrame Partitions df.rdd.getNumPartitions() ● SparkSQL Shuffle Partitions spark.sql.shuffle.partitions ● Table Level Partitions df.write.partitionBy(“year”). save(“data.parquet”) All About Partitions
  • 27. # CSV df = sqlContext.read. format('com.databricks.spark.csv'). options(header='true', inferSchema='true'). load('/path/to/data') # JSON df = sqlContext.read.json("/tmp/test.json") df.write.json("/tmp/test_output.json") PySpark ETL APIs - Text Formats
  • 28. PySpark ETL APIs - Container Formats # Binary Container Formats # Avro df = sqlContext.read. format("com.databricks.spark.avro"). load("/path/to/files/") # Parquet df = sqlContext.read.parquet("/path/to/files/") df.write.parquet("/path/to/files/")
  • 29. ● Manage Number of Files ○ APIs manage the number of files per directory df.repartition(80). write. parquet("/path/to/parquet/") df.repartition(80) partitionBy("year") write. parquet("/path/to/parquet/") PySpark ETL APIs
  • 30. Common ETL Problems ● Malformed JSON Records sqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE _corrupt_record IS NOT NULL") ● Mismatched DataFrame Schema ○ Null Representation vs Schema DataType ● Many Small Files / No Partition Strategy ○ Parquet Files: ~128MB - 256MB Compressed Ref: https://databricks.gitbooks.io/databricks-spark-knowledge- base/content/best_practices/dealing_with_bad_data.html
  • 31. Debugging Spark Spark Driver Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed 4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal): java.nio.channels.ClosedChannelException Spark Executor Error: 16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task. java.text.ParseException: Unparseable number: "N" at java.text.NumberFormat.parse(NumberFormat.java:385) at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58) at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58) at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58) at scala.util.Try.getOrElse(Try.scala:77) at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)
  • 34. SparkSQL Best Practices ● DataFrames and SparkSQL are synonyms ● Use builtin functions instead of custom UDFs ○ import pyspark.sql.functions ● Examples: ○ to_date() ○ get_json_object() ○ regexp_extract() ○ hour() / minute() Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
  • 35. SparkSQL Best Practices ● Large Table Joins ○ Largest Table on LHS ○ Increase Spark Shuffle Partitions ○ Leverage “cluster by” API included in Spark 1.6 sqlCtx.sql("select * from large_table_1 cluster by num1") .registerTempTable("sorted_large_table_1"); sqlCtx.sql(“cache table sorted_large_table_1”);
  • 36. PySpark API Best Practices ● User Defined Functions (UDFs) from pyspark.sql import functions as F add_n = udf(lambda x, y: x + y, IntegerType()) # We register a UDF that adds a column to the DataFrame, and we cast the id column to an Integer type. df = df.withColumn('id_offset', add_n( F.lit(1000), df.id.cast(IntegerType())))
  • 37. PySpark API Best Practices ● Built-in Functions corpus_df = df.select( F.lower( F.col('body')).alias('corpus'), F.monotonicallyIncreasingId().alias('id')) corpus_df = df.select( F.date_format( F.from_utc_timestamp( F.from_unixtime(F.col('created_utc'), "PST"), 'EEEE')).alias ('dayofweek')) Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
  • 38. PySpark API Best Practices ● User Defined Functions (UDFs) def squared(s): return s * s sqlContext.udf.register("squaredWithPython", squared) display(df.select("id", squared_udf("id").alias("id_squared")))
  • 41. Why Spark ML Provide general purpose ML algorithms on top of Spark • Let Spark handle the distribution of data and queries; scalability • Leverage its improvements (e.g. DataFrames, Datasets, Tungsten) Advantages of MLlib’s Design: • Simplicity • Scalability • Streamlined end-to-end • Compatibility
  • 42. High-level functionality in MLlib Learning tasks Classification Regression Recommendation Clustering Frequent itemsets 42 Workflow utilities • Model import/export • Pipelines • DataFrames • Cross validation Data utilities • Feature extraction & selection • Statistics • Linear algebra
  • 43. Machine Learning: What and Why? ML uses data to identify patterns and make decisions. Core value of ML is automated decision making • Especially important when dealing with TB or PB of data Many Use Cases including: • Marketing and advertising optimization • Security monitoring / fraud detection • Operational optimizations
  • 44. Algorithm coverage in MLlib Classification • Logistic regression w/ elastic net • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees • Multilayer perceptron • One-vs-rest Regression • Least squares w/ elastic net • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods Recommendation • Alternating Least Squares Frequent itemsets • FP-growth • Prefix span Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation Linear algebra • Local dense & sparse vectors & matrices • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Model import/export Pipelines Feature extraction & selection • Binarizer • Bucketizer • Chi-Squared selection • CountVectorizer • Discrete cosine transform • ElementwiseProduct • Hashing term frequency • Inverse document frequency • MinMaxScaler • Ngram • Normalizer • One-Hot Encoder • PCA • PolynomialExpansion • RFormula • SQLTransformer • Standard scaler • StopWordsRemover • StringIndexer • Tokenizer • StringIndexer • VectorAssembler • VectorIndexer • VectorSlicer • Word2Vec List based on Spark 1.5 44
  • 45. Spark ML Best Practices ● Spark MLLib vs SparkML ○ Understand the differences ● Don’t Pipeline Too Many Stages ○ Check Results Between Stages
  • 46. PySpark ML API Best Practices
  • 47. PySpark ML API Best Practices
  • 48. ● DataFrame to RDD Mapping def tokenize(text): tokens = word_tokenize(text) lowercased = [t.lower() for t in tokens] no_punctuation = [] for word in lowercased: punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION]) no_punctuation.append(punct_removed) no_stopwords = [w for w in no_punctuation if not w in STOPWORDS] stemmed = [STEMMER.stem(w) for w in no_stopwords] return [w for w in stemmed if w] rdd = wordsDataFrame.map(lambda x: (x.__getitem__('id'), tokenize(x.__getitem__('corpus')))) PySpark ML API Best Practices
  • 49. Learning more about MLlib Guides & examples • Example workflow using ML Pipelines (Python) • The above 2 links are part of the Databricks Guide, which contains many more examples and references. References • Apache Spark MLlib User Guide • The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation. • Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv. org/abs/1505.06807 (academic paper) 49
  • 51. Thanks! Sign Up For Databricks Community Edition! http://go.databricks.com/databricks-community- edition-beta-waitlist