SlideShare a Scribd company logo
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Gurpreet Singh, Microsoft
DASK and Apache Spark
#UnifiedAnalytics #SparkAISummit
3#UnifiedAnalytics #SparkAISummit
…this talk is also about Scaling Python for Data Analysis & Machine Learning!
We’ll start with
briefly reviewing
the Scalability
Challenge in the
PyData stack…
…before Comparing
and Contrasting …
Pandas & Scikit-learn
4#UnifiedAnalytics #SparkAISummit
Options to Scale…
GIL Challenge Multiprocessing
Concurrent Futures
JobLib
Partial_Fit()
Hashing Trick
5#UnifiedAnalytics #SparkAISummit
High Level APIs
RDD
Directed Acyclic Graph (DAG)
Lazy
Execution
Task Scheduler
Synchronous
Multiprocessing
Threaded
Local Spark Standalone
Low Level APIs
Custom
Algorithms
Spark Streaming
Spark MLlib
GraphFrames
Spark SQL /
DataFrames
Distributed
DASK Delayed DASK Futures
Design
Approach
DASK Arrays
[Parallel NumPy]
DASK DataFrames
[Parallel Pandas]
DASK-ML
[Parallel Scikit-learn]
DASK Bag
[Parallel Lists]
Mesos YARN
Local Scheduler
Pluggable Task Scheduling System
Custom
Graphs
Submit
graph as
Python
Dictionary
object
GraphX doesn’tsupport Python
DASK DataFrame & PySpark
6#UnifiedAnalytics #SparkAISummit
DASK DataFrames
[Parallel Pandas]
§ Performance Concerns due to the PySpark Design§ DASK DataFrames API is not identical with Pandas API
§ Performance Concerns with Operations involving Shuffling
§ Inefficiencies of Pandas are carried over
Challenges Challenges
§ Follow the Pandas Performance tips
§ Avoid Shuffle, Use pre-sorting, Persist the Results
§ Use DataFrames API
§ Use Vectorized/Pandas UDF (Spark v2.3 onwards)
RecommendationsRecommendations
7#UnifiedAnalytics #SparkAISummit
Code Review
Wildcard
Wildcard Additional Details
8#UnifiedAnalytics #SparkAISummit
While Pandas display
a sample of the data,
DASK and PySpark
show metadata of
the DataFrame.
The npartitions value shows
how many partitions the
DataFrame is split into.
DASK created a DAG with 99 nodes to process
the data.
Code Review
9#UnifiedAnalytics #SparkAISummit
Code Review
.compute() or
.head() method tells Dask
to go ahead and run the
computation and display
the results.
.show() displays the
DataFrame in a tabular
form.
How does DASK-ML
work?
Parallelize Scikit-Learn
Re-implement
Algorithms
Partner with existing
Libraries
Scalable Machine Learning
10#UnifiedAnalytics #SparkAISummit
OCT ‘17 - DASK-ML
Spark MLlib - As of Spark 2.0, the primary Machine Learning API for
Spark is now the DataFrame-based API in the spark.ml package. It
provides:
Distributed
JobLib
Algorithms
Featurization
Pipelines
Persistence
Utilities
Scalable ML Approaches
§ Spark for Feature Engineering + Scikit-learn etc. for Learning
§ Distributed ML Algorithms from Spark MLlib
§ Train/evaluate Scikit-learn models in parallel (spark-sklearn)
from sklearn.externals.joblib import _dask, parallel_backend
from sklearn.utils import register_parallel_backend
register_parallel_backend('distributed', _dask.DaskDistributedBackend)
from dask_ml.xgboost import XGBRegressor
est = XGBRegressor(...)
est.fit(train, train_labels)
prediction = est.predict(test)
Common algorithms e.g. classification, regression, & clustering
Feature extraction, Transformation, Dimensionality Reduction
Constructing, evaluating and tuning ML pipelines
Save/Load Algorithms, Models, and Pipelines
Linear Algebra, Statistics, and Data Handling
Transformers
Estimators
Pipelines Chains multiple Transformers and Estimators as per the ML Flow
Algorithm that transfers one DataFrame into another
Algorithm that trains and produces a model
Distributed Deep Learning
11#UnifiedAnalytics #SparkAISummit
with
Deep
Learning
Pipelines
Project
Hydrogen
Peer a DASK Distributed cluster
with TensorFlow running in
distributed mode.
§ APIs for scalable deep learning in Python from
Databricks
§ Provides a suite of tools covering loading, Training,
Tuning and Deploying
§ Simplifies Distributed Deep Learning Training
§ Supports TensorFlow, Keras and PyTorch
§ Integration with PySpark
New scheduling option called Gang Scheduler
Other Dev Considerations…
§ Workloads/APIs
§ Custom Algorithms (only in DASK)
§ SQL, Graph (only in Spark)
§ Debugging Challenges
§ DASK Distributed may not align with normal Python Debugging Tools/Practices
§ PySpark errors may have a mix of JVM and Python Stack Trace
§ Visualization Options
§ Down-sample and use Pandas DataFrames
§ Use open source Libraries e.g. D3, Seaborn, Datashader (only for DASK) etc.
§ Use Databricks Visualization Feature
12#UnifiedAnalytics #SparkAISummit
Which one to Use?
13#UnifiedAnalytics #SparkAISummit
There Are No Solutions, There Are Only Tradeoffs! – Thomas Sowell
Questions?
14#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PPTX
Optimizing Apache Spark SQL Joins
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
Introduction to Spark with Python
PPTX
Intro to Apache Spark
PDF
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Apache Spark on K8S Best Practice and Performance in the Cloud
Introducing DataFrames in Spark for Large Scale Data Science
Optimizing Apache Spark SQL Joins
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Introduction to Spark with Python
Intro to Apache Spark
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

What's hot (20)

PPTX
Introduction to Apache Spark
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PPTX
Introduction to ML with Apache Spark MLlib
PPTX
Introduction to Azure Databricks
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
Dive into PySpark
PDF
Deep Dive: Memory Management in Apache Spark
PPTX
Elastic Stack Introduction
PDF
PySpark in practice slides
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PDF
Apache Spark Introduction
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
PDF
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PDF
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
PDF
Introduction to Spark Internals
PDF
Pinterest - Big Data Machine Learning Platform at Pinterest
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PDF
Introduction to Apache Spark
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PDF
Introduction to apache spark
Introduction to Apache Spark
Apache Spark in Depth: Core Concepts, Architecture & Internals
Introduction to ML with Apache Spark MLlib
Introduction to Azure Databricks
Building Robust ETL Pipelines with Apache Spark
Dive into PySpark
Deep Dive: Memory Management in Apache Spark
Elastic Stack Introduction
PySpark in practice slides
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Introduction
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Introduction to Spark Internals
Pinterest - Big Data Machine Learning Platform at Pinterest
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Introduction to Apache Spark
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Introduction to apache spark
Ad

Similar to DASK and Apache Spark (20)

PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Big Data Analytics and Ubiquitous computing
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Media_Entertainment_Veriticals
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
An introduction To Apache Spark
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Fossasia 2018-chetan-khatri
PPTX
Introduction to Spark ML
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Apache spark - Architecture , Overview & libraries
PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
Apache Spark Introduction.pdf
PDF
Spark streaming , Spark SQL
PDF
Unified Big Data Processing with Apache Spark
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Processing Large Data with Apache Spark -- HasGeek
Big Data Analytics and Ubiquitous computing
GraphFrames: DataFrame-based graphs for Apache® Spark™
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Media_Entertainment_Veriticals
A look under the hood at Apache Spark's API and engine evolutions
An introduction To Apache Spark
Jump Start with Apache Spark 2.0 on Databricks
Fossasia 2018-chetan-khatri
Introduction to Spark ML
Unified Big Data Processing with Apache Spark (QCON 2014)
Apache spark - Architecture , Overview & libraries
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Jump Start on Apache Spark 2.2 with Databricks
Apache Spark Introduction.pdf
Spark streaming , Spark SQL
Unified Big Data Processing with Apache Spark
Apache spark sneha challa- google pittsburgh-aug 25th
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPT
Quality review (1)_presentation of this 21
PDF
Business Analytics and business intelligence.pdf
PPTX
Database Infoormation System (DBIS).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Computer network topology notes for revision
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction-to-Cloud-ComputingFinal.pptx
SAP 2 completion done . PRESENTATION.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Knowledge Engineering Part 1
STERILIZATION AND DISINFECTION-1.ppthhhbx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Quality review (1)_presentation of this 21
Business Analytics and business intelligence.pdf
Database Infoormation System (DBIS).pptx
.pdf is not working space design for the following data for the following dat...
Reliability_Chapter_ presentation 1221.5784
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Computer network topology notes for revision
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
IB Computer Science - Internal Assessment.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

DASK and Apache Spark

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Gurpreet Singh, Microsoft DASK and Apache Spark #UnifiedAnalytics #SparkAISummit
  • 3. 3#UnifiedAnalytics #SparkAISummit …this talk is also about Scaling Python for Data Analysis & Machine Learning! We’ll start with briefly reviewing the Scalability Challenge in the PyData stack… …before Comparing and Contrasting …
  • 4. Pandas & Scikit-learn 4#UnifiedAnalytics #SparkAISummit Options to Scale… GIL Challenge Multiprocessing Concurrent Futures JobLib Partial_Fit() Hashing Trick
  • 5. 5#UnifiedAnalytics #SparkAISummit High Level APIs RDD Directed Acyclic Graph (DAG) Lazy Execution Task Scheduler Synchronous Multiprocessing Threaded Local Spark Standalone Low Level APIs Custom Algorithms Spark Streaming Spark MLlib GraphFrames Spark SQL / DataFrames Distributed DASK Delayed DASK Futures Design Approach DASK Arrays [Parallel NumPy] DASK DataFrames [Parallel Pandas] DASK-ML [Parallel Scikit-learn] DASK Bag [Parallel Lists] Mesos YARN Local Scheduler Pluggable Task Scheduling System Custom Graphs Submit graph as Python Dictionary object GraphX doesn’tsupport Python
  • 6. DASK DataFrame & PySpark 6#UnifiedAnalytics #SparkAISummit DASK DataFrames [Parallel Pandas] § Performance Concerns due to the PySpark Design§ DASK DataFrames API is not identical with Pandas API § Performance Concerns with Operations involving Shuffling § Inefficiencies of Pandas are carried over Challenges Challenges § Follow the Pandas Performance tips § Avoid Shuffle, Use pre-sorting, Persist the Results § Use DataFrames API § Use Vectorized/Pandas UDF (Spark v2.3 onwards) RecommendationsRecommendations
  • 8. 8#UnifiedAnalytics #SparkAISummit While Pandas display a sample of the data, DASK and PySpark show metadata of the DataFrame. The npartitions value shows how many partitions the DataFrame is split into. DASK created a DAG with 99 nodes to process the data. Code Review
  • 9. 9#UnifiedAnalytics #SparkAISummit Code Review .compute() or .head() method tells Dask to go ahead and run the computation and display the results. .show() displays the DataFrame in a tabular form.
  • 10. How does DASK-ML work? Parallelize Scikit-Learn Re-implement Algorithms Partner with existing Libraries Scalable Machine Learning 10#UnifiedAnalytics #SparkAISummit OCT ‘17 - DASK-ML Spark MLlib - As of Spark 2.0, the primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. It provides: Distributed JobLib Algorithms Featurization Pipelines Persistence Utilities Scalable ML Approaches § Spark for Feature Engineering + Scikit-learn etc. for Learning § Distributed ML Algorithms from Spark MLlib § Train/evaluate Scikit-learn models in parallel (spark-sklearn) from sklearn.externals.joblib import _dask, parallel_backend from sklearn.utils import register_parallel_backend register_parallel_backend('distributed', _dask.DaskDistributedBackend) from dask_ml.xgboost import XGBRegressor est = XGBRegressor(...) est.fit(train, train_labels) prediction = est.predict(test) Common algorithms e.g. classification, regression, & clustering Feature extraction, Transformation, Dimensionality Reduction Constructing, evaluating and tuning ML pipelines Save/Load Algorithms, Models, and Pipelines Linear Algebra, Statistics, and Data Handling Transformers Estimators Pipelines Chains multiple Transformers and Estimators as per the ML Flow Algorithm that transfers one DataFrame into another Algorithm that trains and produces a model
  • 11. Distributed Deep Learning 11#UnifiedAnalytics #SparkAISummit with Deep Learning Pipelines Project Hydrogen Peer a DASK Distributed cluster with TensorFlow running in distributed mode. § APIs for scalable deep learning in Python from Databricks § Provides a suite of tools covering loading, Training, Tuning and Deploying § Simplifies Distributed Deep Learning Training § Supports TensorFlow, Keras and PyTorch § Integration with PySpark New scheduling option called Gang Scheduler
  • 12. Other Dev Considerations… § Workloads/APIs § Custom Algorithms (only in DASK) § SQL, Graph (only in Spark) § Debugging Challenges § DASK Distributed may not align with normal Python Debugging Tools/Practices § PySpark errors may have a mix of JVM and Python Stack Trace § Visualization Options § Down-sample and use Pandas DataFrames § Use open source Libraries e.g. D3, Seaborn, Datashader (only for DASK) etc. § Use Databricks Visualization Feature 12#UnifiedAnalytics #SparkAISummit
  • 13. Which one to Use? 13#UnifiedAnalytics #SparkAISummit There Are No Solutions, There Are Only Tradeoffs! – Thomas Sowell
  • 15. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT