Apache Spark Usage in the
Open Source Ecosystem
Hossein Falaki
@mhfalaki
About me
• Software Engineer /part-time Data Scientist atDatabricks
• I started using Apache Spark since version 0.6
• Developed first version of Apache Spark CSV data source
• Worked on SparkR and Rnotebooks at Databricks
2
Stackoverflow 2016 trending tech
3
Apache Spark Philosophy
Unified engine
Support end-to-end applications
High-level APIs
Easy to use, rich optimizations
Integrate broadly
Storage systems, libraries, etc
SQLStreaming ML Graph
…
1
2
3
Databricks Community Edition
• In February Databricks launched a free version of its cloud based
platform in beta
• Since then more than 8,000 users registered
• Users created over 61,000 notebooks indifferent languages
• This is an analysis of third party libraries that our beta users
imported to complement Apache Spark in Scala, Python, and R
5
What % of users use other libraries
Language %	users	importing external	libs Average	#	libs Median	#	libs
Python 75	% 9 2
Scala 55	% 3 1
R 57	% 6 1
6
Installing libraries is easy
7
Python Packages
8
Most popular Python packages
9
What is test_helper?
10
What are these?
ETL
• re
• datetime
• pandas
• json
• csv
• string
• math /operator
• urllib /urllib2
11
Visualization
• matplotlib
• ggplot
• seaborn
Advanced analytics
• numpy
• sklearn
• graphframes
• tensorflow
• scipy
Other
• test_helper
• os
• md5
Python package categories
12
What packages go together?
13
Scala Packages
14
Most popular Scala libraries
15
What are these?
ETL
• java/scala util
• scala.collection
• scala.math
• java.{io, nio}
• java.text
• o.a.commons
• kafka
• twitter4j
16
Visualization
• ?
Advanced analytics
• spark.ml
• graphframes
Other
• java.net
• scala.sys
Scala package categories
17
What libraries go together?
18
R Packages
19
Most popular R packages
20
What are these?
ETL
• dplyr
• plyr
• reshape2
• jsonlite
• tidyr
• lubridate
• httr
• data.table
21
Visualization
• ggplot2
• beanplot
• plotly
• ...
Advanced analytics
• sparkr
• h2o
• caret
• e1071
Other
• devtools
• magrittr
R package categories
22
Comparing Python, Scala & R
23
Languages have unique features
24
Scala/ Python / R R / Python Scala / Python/ R
• 25 % of users,use multiple languages
• 3% of notebooks mix different languages
Summary
• Spark users extensively mix itwith other packages in different languages
– One ofgoals ofSpark project is working well with other projects
• ETL related libraries are the most popular category
– Opportunities for newdata sources
• Notebooks are being used for “small data” aswell as“big data.”
• Languages and their ecosystems have diverse capabilities. Users seem to
be mixing languages to their advantage
– Scala is missing visualization libraries
25
Try your favorite library in Databricks
26
http://databricks.com/ce
Try latest version of Apache Spark and previewof Spark 2.0
Thank you!
What packages are used together?
28

More Related Content

PDF
New directions for Apache Spark in 2015
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
PDF
SSR: Structured Streaming for R and Machine Learning
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
New directions for Apache Spark in 2015
Composable Parallel Processing in Apache Spark and Weld
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
SSR: Structured Streaming for R and Machine Learning
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark

What's hot (20)

PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
PDF
Spark Meetup at Uber
PDF
Distributed ML in Apache Spark
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Enabling exploratory data science with Spark and R
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
What's New in Apache Spark 2.3 & Why Should You Care
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
Operational Tips for Deploying Spark
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
PDF
Spark Summit EU talk by Tim Hunter
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Large-Scale Data Science in Apache Spark 2.0
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Meetup at Uber
Distributed ML in Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Spark Under the Hood - Meetup @ Data Science London
Jump Start with Apache Spark 2.0 on Databricks
Enabling exploratory data science with Spark and R
Jump Start with Apache Spark 2.0 on Databricks
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Spark Summit EU talk by Shay Nativ and Dvir Volk
A look under the hood at Apache Spark's API and engine evolutions
What's New in Apache Spark 2.3 & Why Should You Care
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Operational Tips for Deploying Spark
Jump Start on Apache® Spark™ 2.x with Databricks
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Spark Summit EU talk by Tim Hunter
Ad

Viewers also liked (20)

PDF
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
PDF
Introduction to Apache Spark Ecosystem
PDF
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
PPTX
Introduction to Hive
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Spark is going to replace Apache Hadoop! Know Why?
PPTX
Big data spain keynote nov 2016
PPTX
Hive ACID Apache BigData 2016
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
PDF
Apache Spark 101
PDF
2016 spark survey
PPTX
Big data Processing with Apache Spark & Scala
PPTX
Big Data Trend with Open Platform
PDF
Data Science with Apache Spark - Crash Course - HS16SJ
PDF
PySpark Best Practices
PDF
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
PPT
Hive Training -- Motivations and Real World Use Cases
PDF
Fast Data Analytics with Spark and Python
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Introduction to Apache Spark Ecosystem
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
Introduction to Hive
Apache spark sneha challa- google pittsburgh-aug 25th
Spark is going to replace Apache Hadoop! Know Why?
Big data spain keynote nov 2016
Hive ACID Apache BigData 2016
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Apache Spark 101
2016 spark survey
Big data Processing with Apache Spark & Scala
Big Data Trend with Open Platform
Data Science with Apache Spark - Crash Course - HS16SJ
PySpark Best Practices
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Hive Training -- Motivations and Real World Use Cases
Fast Data Analytics with Spark and Python
Python and Bigdata - An Introduction to Spark (PySpark)
Ad

Similar to Apache Spark Usage in the Open Source Ecosystem (20)

PDF
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
PDF
Started with-apache-spark
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
PDF
Contributing to Apache Spark 3
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
PDF
Sparkr sigmod
PDF
Koalas: Unifying Spark and pandas APIs
PPTX
Apache Spark in Industry
PDF
Spark Programming Basic Training Handout
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
Running R at Scale with Apache Arrow on Spark
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
PDF
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
PDF
Big data analysis using spark r published
PPTX
Spark for big data analytics
PPTX
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PDF
Strata NYC 2015 - What's coming for the Spark community
PDF
Introducing Koalas 1.0 (and 1.1)
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
Started with-apache-spark
Strata NYC 2015 - Supercharging R with Apache Spark
Contributing to Apache Spark 3
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Sparkr sigmod
Koalas: Unifying Spark and pandas APIs
Apache Spark in Industry
Spark Programming Basic Training Handout
Apache Spark for Everyone - Women Who Code Workshop
Running R at Scale with Apache Arrow on Spark
39.-Introduction-to-Sparkspark and all-1.pdf
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Big data analysis using spark r published
Spark for big data analytics
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Accelerating Big Data beyond the JVM - Fosdem 2018
Strata NYC 2015 - What's coming for the Spark community
Introducing Koalas 1.0 (and 1.1)

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

DOCX
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
AI Guide for Business Growth - Arna Softech
PPTX
Tech Workshop Escape Room Tech Workshop
PPTX
Introduction to Windows Operating System
PDF
Type Class Derivation in Scala 3 - Jose Luis Pintado Barbero
PDF
Workplace Software and Skills - OpenStax
PDF
iTop VPN Crack Latest Version Full Key 2025
PPTX
Trending Python Topics for Data Visualization in 2025
PDF
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
Guide to Food Delivery App Development.pdf
PPTX
most interesting chapter in the world ppt
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PPTX
Airline CRS | Airline CRS Systems | CRS System
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PPTX
Python is a high-level, interpreted programming language
PPTX
Matchmaking for JVMs: How to Pick the Perfect GC Partner
PPTX
CNN LeNet5 Architecture: Neural Networks
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
AI Guide for Business Growth - Arna Softech
Tech Workshop Escape Room Tech Workshop
Introduction to Windows Operating System
Type Class Derivation in Scala 3 - Jose Luis Pintado Barbero
Workplace Software and Skills - OpenStax
iTop VPN Crack Latest Version Full Key 2025
Trending Python Topics for Data Visualization in 2025
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
Guide to Food Delivery App Development.pdf
most interesting chapter in the world ppt
How Tridens DevSecOps Ensures Compliance, Security, and Agility
BoxLang Dynamic AWS Lambda - Japan Edition
Airline CRS | Airline CRS Systems | CRS System
Topaz Photo AI Crack New Download (Latest 2025)
Python is a high-level, interpreted programming language
Matchmaking for JVMs: How to Pick the Perfect GC Partner
CNN LeNet5 Architecture: Neural Networks

Apache Spark Usage in the Open Source Ecosystem

  • 1. Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki
  • 2. About me • Software Engineer /part-time Data Scientist atDatabricks • I started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR and Rnotebooks at Databricks 2
  • 4. Apache Spark Philosophy Unified engine Support end-to-end applications High-level APIs Easy to use, rich optimizations Integrate broadly Storage systems, libraries, etc SQLStreaming ML Graph … 1 2 3
  • 5. Databricks Community Edition • In February Databricks launched a free version of its cloud based platform in beta • Since then more than 8,000 users registered • Users created over 61,000 notebooks indifferent languages • This is an analysis of third party libraries that our beta users imported to complement Apache Spark in Scala, Python, and R 5
  • 6. What % of users use other libraries Language % users importing external libs Average # libs Median # libs Python 75 % 9 2 Scala 55 % 3 1 R 57 % 6 1 6
  • 9. Most popular Python packages 9
  • 11. What are these? ETL • re • datetime • pandas • json • csv • string • math /operator • urllib /urllib2 11 Visualization • matplotlib • ggplot • seaborn Advanced analytics • numpy • sklearn • graphframes • tensorflow • scipy Other • test_helper • os • md5
  • 13. What packages go together? 13
  • 15. Most popular Scala libraries 15
  • 16. What are these? ETL • java/scala util • scala.collection • scala.math • java.{io, nio} • java.text • o.a.commons • kafka • twitter4j 16 Visualization • ? Advanced analytics • spark.ml • graphframes Other • java.net • scala.sys
  • 18. What libraries go together? 18
  • 20. Most popular R packages 20
  • 21. What are these? ETL • dplyr • plyr • reshape2 • jsonlite • tidyr • lubridate • httr • data.table 21 Visualization • ggplot2 • beanplot • plotly • ... Advanced analytics • sparkr • h2o • caret • e1071 Other • devtools • magrittr
  • 24. Languages have unique features 24 Scala/ Python / R R / Python Scala / Python/ R • 25 % of users,use multiple languages • 3% of notebooks mix different languages
  • 25. Summary • Spark users extensively mix itwith other packages in different languages – One ofgoals ofSpark project is working well with other projects • ETL related libraries are the most popular category – Opportunities for newdata sources • Notebooks are being used for “small data” aswell as“big data.” • Languages and their ecosystems have diverse capabilities. Users seem to be mixing languages to their advantage – Scala is missing visualization libraries 25
  • 26. Try your favorite library in Databricks 26 http://databricks.com/ce Try latest version of Apache Spark and previewof Spark 2.0
  • 28. What packages are used together? 28