SlideShare a Scribd company logo
Project Zen: Improving Apache
Spark for Python Users
Hyukjin Kwon
Databricks Software Engineer
Hyukjin Kwon
▪ Apache Spark Committer / PMC
▪ Major Koalas contributor
▪ Databricks Software Engineer
▪ @HyukjinKwon in GitHub
Agenda
What is Project Zen?
Redesigned Documentation
PySpark Type Hints
Distribution Option for PyPI Users
Roadmap
What is Project Zen?
Python Growth
68%
of notebook
commands on
Databricks are in
Python
PySpark Today
▪ Documentation difficult to
navigate
▪ All APIs under each module is listed in single page
▪ No other information or classification
▪ Lack of information
▪ No Quickstart page
▪ No Installation page
▪ No Introduction
PySpark documentation
PySpark Today
▪ IDE unfriendly
▪ Dynamically defined functions reported as missing
functions
▪ Lack of autocompletion support
▪ Lack of type checking support
▪ Notebook unfriendly
▪ Lack of autocompletion support
Missing import in IDE
Autocompletion in IDE
Autocompletion in Jupyter
PySpark Today
▪ Less Pythonic
▪ Deprecated Python built-in instance support when creating a DataFrame
>>> spark.createDataFrame([{'a': 1}])
/.../session.py:378: UserWarning: inferring schema from dict is deprecated, please use
pyspark.sql.Row instead
PySpark Today
▪ Missing distributions with other Hadoop versions in PyPI
▪ Missing Hadoop 3 distribution
▪ Missing Hive 1.2 distribution
PyPI DistributionApache Mirror
PySpark Today
▪ Inconsistent exceptions and warnings
▪ Unclassified exceptions and warnings
>>> spark.range(10).explain(1, 2)
Traceback (most recent call last):
...
Exception: extended and mode should not be set together.
The Zen of Python
PEP 20 -- The Zen of Python
Project Zen (SPARK-32082)
▪ Be Pythonic
▪ The Zen of Python
▪ Python friendly
▪ Better and easier use of PySpark
▪ Better documentation
▪ Clear exceptions and warnings
▪ Python type hints: autocompletion, static type checking and error detection
▪ More options for pip installation
▪ Better interoperability with other Python libraries
▪ pandas, pyarrow, NumPy, Koalas, etc.
▪ Visualization
Redesigned Documentation
Problems in PySpark Documentation
(Old) PySpark documentation
▪ Everything in few pages
▪ Whole module in single page w/o
classification
▪ Difficult to navigate
▪ Very long to stroll down
▪ Virtually no structure
▪ No other useful pages
▪ How to start?
▪ How to ship 3rd party packages together?
▪ How to install?
▪ How to debug / setup an IDE?
New PySpark Documentation
New PySpark documentation
New PySpark Documentation
New user guide page
New PySpark Documentation
Search
Other pages
Top menu
Contents in
the current
page
Sub-titles in
the current
page
New API Reference
New API reference page
New API Reference
New API reference page
└── module A
├── classification A
…
└── classification ...
New API Reference
New API reference page
New API Reference
New API reference page
Table for each classification
New API Reference
New API reference page
New API Reference
New API reference page
Each page for each API
Quickstart
Quickstart page
Live Notebook
Move to live notebook (Binder integration)
Live Notebook
Live notebook (Binder integration)
Other New Pages
New useful pages
PySpark Type Hints
What are Python Type Hints?
def greeting(name):
return 'Hello ' + name
Typical Python codes
def greeting(name: str) -> str:
return 'Hello ' + name
Python codes with type hints
def greeting(name: str) -> str: ...
Stub syntax (.pyi file)
Why are Python Type Hints good?
▪ IDE Support
▪ Notebook Support
▪ Documentation
▪ Static error detection
Before type hints
After type hints
Why are Python Type Hints good?
▪ IDE Support
▪ Notebook Support
▪ Documentation
▪ Static error detection
Before type hints
After type hints
Why are Python Type Hints good?
▪ IDE Support
▪ Notebook Support
▪ Documentation
▪ Static error detection
Before type hints
After type hints
Why are Python Type Hints good?
▪ IDE Support
▪ Notebook Support
▪ Documentation
▪ Static error detection
Static error detection
https://github.com/zero323/pyspark-stubs#motivation
Python Type Hints in PySpark
Built-in in the upcoming Apache Spark 3.1!
Community support: zero323/pyspark-stubs
User facing APIs only
Stub (.pyi) files
Installation Option for PyPI Users
PyPI Distribution
PySpark on PyPI
PyPI Distribution
▪ Multiple distributions available
▪ Hadoop 2.7 and Hive 1.2
▪ Hadoop 2.7 and Hive 2.3
▪ Hadoop 3.2 and Hive 2.3
▪ Hive 2.3 without Hadoop
▪ PySpark distribution in PyPI
▪ Hadoop 2.7 and Hive 1.3
Multiple distributions
in Apache Mirror
One distribution in PyPI
New Installation Options
HADOOP_VERSION=3.2 pip install pyspark
HADOOP_VERSION=2.7 pip install pyspark
HADOOP_VERSION=without pip install pyspark
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org HADOOP_VERSION=2.7 pip install
Spark with Hadoop 3.2
Spark with Hadoop 2.7
Spark without Hadoop
Spark downloading from the
specified mirror
Why not pip --install-options?
Ongoing issues in pip
Roadmap
Roadmap
▪ Migrate to NumPy documentation style
▪ Better classification
▪ Better readability
▪ Widely used
"""Specifies some hint on the current
:class:`DataFrame`.
:param name: A name of the hint.
:param parameters: Optional parameters.
:return: :class:`DataFrame`
"""Specifies some hint on the
current :class:`DataFrame`.
Parameters
----------
name : str
A name of the hint.
parameters : dict, optional
Optional parameters
Returns
-------
DataFrame
Numpydoc stylereST style
Roadmap
▪ Standardize warnings and exceptions
▪ Classify the exception and warning types
▪ Python friendly messages instead of JVM stack trace
>>> spark.range(10).explain(1, 2)
Traceback (most recent call last):
...
Exception: extended and mode should not be set together.
Plain Exception being thrown
Roadmap
▪ Interoperability between NumPy,
Koalas, other libraries
▪ Common features in DataFrames
▪ NumPy universe function
▪ Visualization and plotting
▪ Make a chart from Spark DataFrame
Re-cap
Re-cap
▪ Python and PySpark are becoming more and more popular
▪ PySpark documentation is redesigned with many new pages
▪ Auto-completion and type checking in IDE and notebooks
▪ PySpark download options in PyPI
Re-cap: What’s next?
▪ Migrate to NumPy documentation style
▪ Standardize warnings and exceptions
▪ Visualization
▪ Interoperability between NumPy, Koalas, other libraries
Question?
Ad

Recommended

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Databricks
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
Databricks
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
Interactive Analytics using Apache Spark
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
Spark Summit
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Spark Summit
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Databricks
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
DataWorks Summit
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
Databricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Pyspark tutorial
Pyspark tutorial
HarikaReddy115
 
Pyspark tutorial
Pyspark tutorial
HarikaReddy115
 

More Related Content

What's hot (20)

Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Spark Summit
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Databricks
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
DataWorks Summit
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
Databricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Spark Summit
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Databricks
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
DataWorks Summit
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
Databricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 

Similar to Project Zen: Improving Apache Spark for Python Users (20)

Pyspark tutorial
Pyspark tutorial
HarikaReddy115
 
Pyspark tutorial
Pyspark tutorial
HarikaReddy115
 
Overview of Apache Spark and PySpark.pptx
Overview of Apache Spark and PySpark.pptx
Accentfuture
 
Apache Spark SQL- Installing Spark
Apache Spark SQL- Installing Spark
Experfy
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
Shankar M S
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
Takuya UESHIN
 
PySaprk
PySaprk
Giivee The
 
20180417 hivemall meetup#4
20180417 hivemall meetup#4
Takeshi Yamamuro
 
Dive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Introduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Started with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OS
Universiti Technologi Malaysia (UTM)
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyond
Xiao Li
 
Overview of Apache Spark and PySpark.pptx
Overview of Apache Spark and PySpark.pptx
Accentfuture
 
Apache Spark SQL- Installing Spark
Apache Spark SQL- Installing Spark
Experfy
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
Shankar M S
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
Takuya UESHIN
 
20180417 hivemall meetup#4
20180417 hivemall meetup#4
Takeshi Yamamuro
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Introduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyond
Xiao Li
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
Taqyea
 
presentation4.pdf Intro to mcmc methodss
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
Mynd company all details what they are doing a
Mynd company all details what they are doing a
AniketKadam40952
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
Reliability Monitoring of Aircrfat commerce
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PPT2 W1L2.pptx.........................................
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
Informatics Market Insights AI Workforce.pdf
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
Taqyea
 
presentation4.pdf Intro to mcmc methodss
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
Mynd company all details what they are doing a
Mynd company all details what they are doing a
AniketKadam40952
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
Reliability Monitoring of Aircrfat commerce
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PPT2 W1L2.pptx.........................................
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
Informatics Market Insights AI Workforce.pdf
Informatics Market Insights AI Workforce.pdf
karizaroxx
 

Project Zen: Improving Apache Spark for Python Users

  • 1. Project Zen: Improving Apache Spark for Python Users Hyukjin Kwon Databricks Software Engineer
  • 2. Hyukjin Kwon ▪ Apache Spark Committer / PMC ▪ Major Koalas contributor ▪ Databricks Software Engineer ▪ @HyukjinKwon in GitHub
  • 3. Agenda What is Project Zen? Redesigned Documentation PySpark Type Hints Distribution Option for PyPI Users Roadmap
  • 5. Python Growth 68% of notebook commands on Databricks are in Python
  • 6. PySpark Today ▪ Documentation difficult to navigate ▪ All APIs under each module is listed in single page ▪ No other information or classification ▪ Lack of information ▪ No Quickstart page ▪ No Installation page ▪ No Introduction PySpark documentation
  • 7. PySpark Today ▪ IDE unfriendly ▪ Dynamically defined functions reported as missing functions ▪ Lack of autocompletion support ▪ Lack of type checking support ▪ Notebook unfriendly ▪ Lack of autocompletion support Missing import in IDE Autocompletion in IDE Autocompletion in Jupyter
  • 8. PySpark Today ▪ Less Pythonic ▪ Deprecated Python built-in instance support when creating a DataFrame >>> spark.createDataFrame([{'a': 1}]) /.../session.py:378: UserWarning: inferring schema from dict is deprecated, please use pyspark.sql.Row instead
  • 9. PySpark Today ▪ Missing distributions with other Hadoop versions in PyPI ▪ Missing Hadoop 3 distribution ▪ Missing Hive 1.2 distribution PyPI DistributionApache Mirror
  • 10. PySpark Today ▪ Inconsistent exceptions and warnings ▪ Unclassified exceptions and warnings >>> spark.range(10).explain(1, 2) Traceback (most recent call last): ... Exception: extended and mode should not be set together.
  • 11. The Zen of Python PEP 20 -- The Zen of Python
  • 12. Project Zen (SPARK-32082) ▪ Be Pythonic ▪ The Zen of Python ▪ Python friendly ▪ Better and easier use of PySpark ▪ Better documentation ▪ Clear exceptions and warnings ▪ Python type hints: autocompletion, static type checking and error detection ▪ More options for pip installation ▪ Better interoperability with other Python libraries ▪ pandas, pyarrow, NumPy, Koalas, etc. ▪ Visualization
  • 14. Problems in PySpark Documentation (Old) PySpark documentation ▪ Everything in few pages ▪ Whole module in single page w/o classification ▪ Difficult to navigate ▪ Very long to stroll down ▪ Virtually no structure ▪ No other useful pages ▪ How to start? ▪ How to ship 3rd party packages together? ▪ How to install? ▪ How to debug / setup an IDE?
  • 15. New PySpark Documentation New PySpark documentation
  • 16. New PySpark Documentation New user guide page
  • 17. New PySpark Documentation Search Other pages Top menu Contents in the current page Sub-titles in the current page
  • 18. New API Reference New API reference page
  • 19. New API Reference New API reference page └── module A ├── classification A … └── classification ...
  • 20. New API Reference New API reference page
  • 21. New API Reference New API reference page Table for each classification
  • 22. New API Reference New API reference page
  • 23. New API Reference New API reference page Each page for each API
  • 25. Live Notebook Move to live notebook (Binder integration)
  • 26. Live Notebook Live notebook (Binder integration)
  • 27. Other New Pages New useful pages
  • 29. What are Python Type Hints? def greeting(name): return 'Hello ' + name Typical Python codes def greeting(name: str) -> str: return 'Hello ' + name Python codes with type hints def greeting(name: str) -> str: ... Stub syntax (.pyi file)
  • 30. Why are Python Type Hints good? ▪ IDE Support ▪ Notebook Support ▪ Documentation ▪ Static error detection Before type hints After type hints
  • 31. Why are Python Type Hints good? ▪ IDE Support ▪ Notebook Support ▪ Documentation ▪ Static error detection Before type hints After type hints
  • 32. Why are Python Type Hints good? ▪ IDE Support ▪ Notebook Support ▪ Documentation ▪ Static error detection Before type hints After type hints
  • 33. Why are Python Type Hints good? ▪ IDE Support ▪ Notebook Support ▪ Documentation ▪ Static error detection Static error detection https://github.com/zero323/pyspark-stubs#motivation
  • 34. Python Type Hints in PySpark Built-in in the upcoming Apache Spark 3.1! Community support: zero323/pyspark-stubs User facing APIs only Stub (.pyi) files
  • 37. PyPI Distribution ▪ Multiple distributions available ▪ Hadoop 2.7 and Hive 1.2 ▪ Hadoop 2.7 and Hive 2.3 ▪ Hadoop 3.2 and Hive 2.3 ▪ Hive 2.3 without Hadoop ▪ PySpark distribution in PyPI ▪ Hadoop 2.7 and Hive 1.3 Multiple distributions in Apache Mirror One distribution in PyPI
  • 38. New Installation Options HADOOP_VERSION=3.2 pip install pyspark HADOOP_VERSION=2.7 pip install pyspark HADOOP_VERSION=without pip install pyspark PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org HADOOP_VERSION=2.7 pip install Spark with Hadoop 3.2 Spark with Hadoop 2.7 Spark without Hadoop Spark downloading from the specified mirror
  • 39. Why not pip --install-options? Ongoing issues in pip
  • 41. Roadmap ▪ Migrate to NumPy documentation style ▪ Better classification ▪ Better readability ▪ Widely used """Specifies some hint on the current :class:`DataFrame`. :param name: A name of the hint. :param parameters: Optional parameters. :return: :class:`DataFrame` """Specifies some hint on the current :class:`DataFrame`. Parameters ---------- name : str A name of the hint. parameters : dict, optional Optional parameters Returns ------- DataFrame Numpydoc stylereST style
  • 42. Roadmap ▪ Standardize warnings and exceptions ▪ Classify the exception and warning types ▪ Python friendly messages instead of JVM stack trace >>> spark.range(10).explain(1, 2) Traceback (most recent call last): ... Exception: extended and mode should not be set together. Plain Exception being thrown
  • 43. Roadmap ▪ Interoperability between NumPy, Koalas, other libraries ▪ Common features in DataFrames ▪ NumPy universe function ▪ Visualization and plotting ▪ Make a chart from Spark DataFrame
  • 45. Re-cap ▪ Python and PySpark are becoming more and more popular ▪ PySpark documentation is redesigned with many new pages ▪ Auto-completion and type checking in IDE and notebooks ▪ PySpark download options in PyPI
  • 46. Re-cap: What’s next? ▪ Migrate to NumPy documentation style ▪ Standardize warnings and exceptions ▪ Visualization ▪ Interoperability between NumPy, Koalas, other libraries