DASK and Apache Spark

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Gurpreet Singh, Microsoft
DASK and Apache Spark
#UnifiedAnalytics #SparkAISummit

3#UnifiedAnalytics #SparkAISummit
…this talk is also about Scaling Python for Data Analysis & Machine Learning!
We’ll start with
briefly reviewing
the Scalability
Challenge in the
PyData stack…
…before Comparing
and Contrasting …

Pandas & Scikit-learn
Options to Scale…
GIL Challenge Multiprocessing
Concurrent Futures
JobLib
Partial_Fit()
Hashing Trick

High Level APIs
RDD
Directed Acyclic Graph (DAG)
Lazy
Execution
Task Scheduler
Synchronous
Multiprocessing
Threaded
Local Spark Standalone
Low Level APIs
Custom
Algorithms
Spark Streaming
Spark MLlib
GraphFrames
Spark SQL /
DataFrames
Distributed
DASK Delayed DASK Futures
Design
Approach
DASK Arrays
[Parallel NumPy]
DASK DataFrames
[Parallel Pandas]
DASK-ML
[Parallel Scikit-learn]
DASK Bag
[Parallel Lists]
Mesos YARN
Local Scheduler
Pluggable Task Scheduling System
Custom
Graphs
Submit
graph as
Python
Dictionary
object
GraphX doesn’tsupport Python

DASK DataFrame & PySpark
DASK DataFrames
[Parallel Pandas]
§ Performance Concerns due to the PySpark Design§ DASK DataFrames API is not identical with Pandas API
§ Performance Concerns with Operations involving Shuffling
§ Inefficiencies of Pandas are carried over
Challenges Challenges
§ Follow the Pandas Performance tips
§ Avoid Shuffle, Use pre-sorting, Persist the Results
§ Use DataFrames API
§ Use Vectorized/Pandas UDF (Spark v2.3 onwards)
RecommendationsRecommendations

Code Review
Wildcard
Wildcard Additional Details

While Pandas display
a sample of the data,
DASK and PySpark
show metadata of
the DataFrame.
The npartitions value shows
how many partitions the
DataFrame is split into.
DASK created a DAG with 99 nodes to process
the data.
Code Review

Code Review
.compute() or
.head() method tells Dask
to go ahead and run the
computation and display
the results.
.show() displays the
DataFrame in a tabular
form.

How does DASK-ML
work?
Parallelize Scikit-Learn
Re-implement
Algorithms
Partner with existing
Libraries
Scalable Machine Learning
OCT ‘17 - DASK-ML
Spark MLlib - As of Spark 2.0, the primary Machine Learning API for
Spark is now the DataFrame-based API in the spark.ml package. It
provides:
Distributed
JobLib
Algorithms
Featurization
Pipelines
Persistence
Utilities
Scalable ML Approaches
§ Spark for Feature Engineering + Scikit-learn etc. for Learning
§ Distributed ML Algorithms from Spark MLlib
§ Train/evaluate Scikit-learn models in parallel (spark-sklearn)
from sklearn.externals.joblib import _dask, parallel_backend
from sklearn.utils import register_parallel_backend
register_parallel_backend('distributed', _dask.DaskDistributedBackend)
from dask_ml.xgboost import XGBRegressor
est = XGBRegressor(...)
est.fit(train, train_labels)
prediction = est.predict(test)
Common algorithms e.g. classification, regression, & clustering
Feature extraction, Transformation, Dimensionality Reduction
Constructing, evaluating and tuning ML pipelines
Save/Load Algorithms, Models, and Pipelines
Linear Algebra, Statistics, and Data Handling
Transformers
Estimators
Pipelines Chains multiple Transformers and Estimators as per the ML Flow
Algorithm that transfers one DataFrame into another
Algorithm that trains and produces a model

Distributed Deep Learning
with
Deep
Learning
Pipelines
Project
Hydrogen
Peer a DASK Distributed cluster
with TensorFlow running in
distributed mode.
§ APIs for scalable deep learning in Python from
Databricks
§ Provides a suite of tools covering loading, Training,
Tuning and Deploying
§ Simplifies Distributed Deep Learning Training
§ Supports TensorFlow, Keras and PyTorch
§ Integration with PySpark
New scheduling option called Gang Scheduler

Other Dev Considerations…
§ Workloads/APIs
§ Custom Algorithms (only in DASK)
§ SQL, Graph (only in Spark)
§ Debugging Challenges
§ DASK Distributed may not align with normal Python Debugging Tools/Practices
§ PySpark errors may have a mix of JVM and Python Stack Trace
§ Visualization Options
§ Down-sample and use Pandas DataFrames
§ Use open source Libraries e.g. D3, Seaborn, Datashader (only for DASK) etc.
§ Use Databricks Visualization Feature

Which one to Use?
There Are No Solutions, There Are Only Tradeoffs! – Thomas Sowell

Questions?

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

DASK and Apache Spark

More Related Content

What's hot (20)

Similar to DASK and Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

DASK and Apache Spark