SlideShare a Scribd company logo
Analytics pipelines with
Jupyter and Spark
Who we are
● NETOPIA
● mobilPay
● mobilPay Wallet
● web2sms
● btko.in
● kartela.ro
● mobilender.mx
Challenges
Three dimensional problem
● Time: Past events or crystall ball?
● Profile: Who is looking at the data?
● Quantity: How much data is there to look at?
Profile
● Data Scientist
● Data Engineer
Quantity
● Hundreds of MB to a few GB
● Up to million events/records
vs.
● GB to TB to PB
● Hundreds of millions to billions and beyond
events/records
Also
● Computing vs. Storage
● Vertical vs. Horizontal scalability
● Distributed/ML libraries
● Dependency hell
Time
NOW
Past Future
Analytics Forecasting
(a.k.a. Prediction)
“Classic” Approach
Small Data Big Data
Data Engineer grep, sed, awk Java, Scala, Python, PIG,
Hadoop, lately Spark &
others
Data Scientist R/RStudio No way, Josè!
New Approach
Small Data Big Data
Data Engineer
Notebook Technologies: Jupyter (most used),
zeppelin, but also less known ones (Rodeo,
Beaker)
Data Scientist
Data analysis with
Jupyter, Pandas and Spark
Outline
About the data:
● Set of mobile transactions
● Set (separate) of retail transactions
About the tools: Jupyter, Pandas and Spark
Our experience
Future work
Mobile transactions Retail data
Elements of
analysis
Transactions Transactions, Products, Stock data
We know Transaction value, User identifier,
Merchant
Transaction value, Sold products,
Merchant
We don’t know What product was bought Who the user is
Size Hundreds of thousands of entries Hundreds of millions of entries
Status Building prediction models Gathering data
Datasets
Mobile transactions data
SQL Database
Mobile data: Environment
Preprocessing notebooks
Analysis and model testing notebooks
Pandas R (with rpy2)
scikit-learn Custom code
CSV files
pickle files
Other input sources
Jupyter
notebooks
in Docker
container
with
Anaconda
Diagnostics
Cleaning
Feature building
Raw data
Models
Visualizations
Docker image
… with Anaconda
● Anaconda: package manager
for data science
● Using docker-compose for
setting up container
parameters
● Many available images
● Our base image:
○ pyspark from Jupyter Docker
Stacks
○ Extended with required libraries
● Libraries are added or
updated with docker build:
○ Self-contained
○ Easy versioning
Jupyter Notebook
(1)
Web application for creating
documents with live code,
explanations and visualizations
● Initially, part of IPython
● Narrative with live code
● Protocol for interactive
exploration
○ Run blocks of code
○ Embedded JS
● Executable documents
○ Code
○ HTML and Markdown
○ Metadata
● Kernels for multiple
languages
○ Python
○ R
○ Scala
○ Bash
● Internal format: JSON
Jupyter Notebook
(2)
Web application for creating
documents with live code,
explanations and visualizations
● Plugins and widgets
● Easy to share (formats:
Notebook, PDF, HTML, …)
● Large ecosystem
○ Jupyter Lab / Jupyter Hub
○ GitHub visualizations
○ Blog integration
○ Education: teaching, evaluation
○ Microsoft, Google, Bloomberg,
IBM, O'Reilly
○ Executable books
● Versioning is complicated
Pandas
● DataFrame objects
○ Tabular data structures
○ Each column has one data type
● Based on numpy (fast)
● Processing is (mostly) done in
memory
● Data manipulation:
○ Hierarchical indexing
○ Reshaping, pivoting, grouping
○ String operations
○ Time series operations
● Reading / writing from / to
many formats (CSV, JSON,
HDF5, …)
● Visualization: matplotlib,
Seaborn, Bokeh, …
Python library for data
manipulation and analysis
rpy2
Interface between Python and
R
● Translates DataFrames
between Python and R
● Python in Jupyter: use %%R
● Direct access to R objects
(rpy2.robjects)
Jupyter, Pandas and R
R with Rpy2
Python
HTML and Markdown
Notebook
Mobile data: User retention
Active users:
● Classic: 1+ transactions in a given period
● Rolling: 1+ transactions in a given or
subsequent period
Plots:
● X: period (day, week, month)
● Y (cohort): period or another type of
segment
● By transaction criteria (merchant,
product, etc.)
Results:
● Response to campaigns
● Activity recurrence
Cohorts
Periods
Mobile data: Correlations
Features:
● How similar are two features?
Merchants:
● Which merchants have common users?
Products:
● Which products are sold together?
Mobile data: Clusters
● Group users by behavior
● Identify outliers
● Future: automatic cluster labeling
Retail transactions data
Retail data: Our experience
First try: Out-of-core processing with HDF5
● Data does did not fit in memory
● HDF5: format for large data
● Pandas + HDF5, Blaze, Dask, Odo
● Easy to use functions
● Library incompatibilities
● Slow queries, use indexes
● Occasional runtime errors
Cassandra
Retail data: Environment
Preprocessing notebooks
Analysis and model testing notebooks
Large data:
Spark ML + scikit-learn
Small (selection) data:
Pandas, scikit-learn and R
CSV files
Apache Parquet
Cassandra
Other input sources
Jupyter
notebooks
in Docker
containers
with Spark
and
Anaconda
Diagnostics
Cleaning
Feature building
Raw data
Models
Visualizations
In progress
Spark
Engine for big data processing
● DataFrames
○ Built on top of RDDs
○ Similar to Pandas and R
○ SQL queries
○ Automatic query optimization
through query plan
○ String , date-time and statistics
functions
○ Group by, filters
● Jupyter integration: work in
progress
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
Spark
Machine Learning
MLlib and ML
● MLlib
○ Uses RDDs
○ Summaries, correlations,
sampling
○ SVMs, logistic regression,
decision trees, ensembles and
Naive Bayes
○ Clustering
○ Feature transformation
● ML
○ Works with DataFrames
○ Many wrappers for MLlib
○ Pipelines:
■ Transformers, Estimators,
Parameters
■ labelCol, featuresCol,
predictionCol, ...
○ R formulas (y ~ x1 + x2)
Retail data: Our experience
Current: Spark + Docker
● No issues at current size (several GBs)
● Docker Compose for creating master, workers and Jupyter container
(driver)
● ML libraries are easy to work with
● Incomplete Python API for ML (e.g., summaries)
● Documentation needs improvement
● Model diagnostics
○ Some metrics are available
○ Supplement with scikit-learn (example: build ROC curves)
● scikit-learn or R on top of Spark
○ Parallelize parameter search (e.g., grid search)
○ Spark sklearn (github.com/databricks/spark-sklearn): Grid Search
Future work
Mobile wallet transactions:
● Data fits in memory
● Use Spark for distributing workload
ERP transactions:
● Some data fits in memory, after processing
● Build a web app for data exploration
● Forecast
○ Sales
○ Inventory requirements
● Try Spark Streaming
http://xkcd.com/1425/

More Related Content

PDF
Jupyter Kernel: How to Speak in Another Language
PDF
PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...
PDF
Building custom kernels for IPython
PDF
Data analytics in the cloud with Jupyter notebooks.
PDF
RDM 2020: Python, Numpy, and Pandas
PDF
Introduction to IPython & Notebook
PDF
그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기
PDF
Machine Learning in Google I/O 19
Jupyter Kernel: How to Speak in Another Language
PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...
Building custom kernels for IPython
Data analytics in the cloud with Jupyter notebooks.
RDM 2020: Python, Numpy, and Pandas
Introduction to IPython & Notebook
그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기
Machine Learning in Google I/O 19

What's hot (19)

PDF
Python for All
ODP
10 popular software programs written in python
PDF
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PDF
Intro to Python Workshop San Diego, CA (January 19, 2013)
ODP
Behold the Power of Python
PDF
DRUG - RDSTK Talk
PDF
Python as the Zen of Data Science
PDF
PPT
Python
PDF
Welcome to Python
PDF
Git by example
PPTX
An introduction to Jupyter notebooks and the Noteable service
PDF
OpenStack: A python based IaaS provider
PDF
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
PPTX
IPTC News in JSON Spring 2013
PPTX
MozillaPH Rust Hack & Learn Session 1
ODP
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
PDF
Intro to Python
Python for All
10 popular software programs written in python
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
Intro to Python Workshop San Diego, CA (January 19, 2013)
Behold the Power of Python
DRUG - RDSTK Talk
Python as the Zen of Data Science
Python
Welcome to Python
Git by example
An introduction to Jupyter notebooks and the Noteable service
OpenStack: A python based IaaS provider
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
IPTC News in JSON Spring 2013
MozillaPH Rust Hack & Learn Session 1
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
Intro to Python
Ad

Viewers also liked (20)

PPTX
2016年疾管署疫情監測週報(第44週)
PDF
Implementation of Rubik's Cube Formula in PyCuber
PPTX
Python for Data Analysis: Chapter 2
PPTX
Intro to Python Data Analysis in Wakari
PDF
BIG DATA サービス と ツール
PPTX
Mobile Wallet Future in Bangladesh
PPTX
data science toolkit 101: set up Python, Spark, & Jupyter
PDF
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
PDF
Using docker for data science - part 2
PDF
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
PDF
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
PDF
Time Series Processing with Solr and Spark
PDF
Growing the Mesos Ecosystem
PPTX
Practical Data Analysis in Python
PDF
Overview of DataStax OpsCenter
PPTX
High Performance Processing of Streaming Data
PPTX
Data analysis with pandas
PPTX
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
PDF
Getting started with pandas
ODP
Data Analysis in Python
2016年疾管署疫情監測週報(第44週)
Implementation of Rubik's Cube Formula in PyCuber
Python for Data Analysis: Chapter 2
Intro to Python Data Analysis in Wakari
BIG DATA サービス と ツール
Mobile Wallet Future in Bangladesh
data science toolkit 101: set up Python, Spark, & Jupyter
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Using docker for data science - part 2
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Time Series Processing with Solr and Spark
Growing the Mesos Ecosystem
Practical Data Analysis in Python
Overview of DataStax OpsCenter
High Performance Processing of Streaming Data
Data analysis with pandas
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Getting started with pandas
Data Analysis in Python
Ad

Similar to Data analysis with Pandas and Spark (20)

PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
ODP
An Introduction to Pentaho Kettle
PDF
Introduction To Pentaho Kettle
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
PDF
Build an Open Source Data Lake For Data Scientists
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
PDF
Python in Industry
PDF
Understanding Hadoop
PDF
Intro to Big Data - Spark
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
PDF
Joker'14 Java as a fundamental working tool of the Data Scientist
PDF
Building data "Py-pelines"
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PDF
NoSQL for Artificial Intelligence
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
AWS Big Data Demystified #1: Big data architecture lessons learned
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
An Introduction to Pentaho Kettle
Introduction To Pentaho Kettle
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Build an Open Source Data Lake For Data Scientists
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Python in Industry
Understanding Hadoop
Intro to Big Data - Spark
A Day in the Life of a Druid Implementor and Druid's Roadmap
Joker'14 Java as a fundamental working tool of the Data Scientist
Building data "Py-pelines"
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
NoSQL for Artificial Intelligence
Data Analytics and Machine Learning: From Node to Cluster on ARM64
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster

More from Felix Crisan (15)

PDF
Big data uservices
PDF
Bitcoin:Next
PPTX
BigData in BlockChains
PDF
Lightning Network
PDF
Smart contracts using web3.js
PDF
Smart contracts in Solidity
PDF
Mashing the data
PDF
Big(data) in block(chains)
PDF
Enablers for o commerce
PDF
mcommad
PDF
NoSQL solutions
PDF
Deconstructing Lambda architectures
PDF
402 @ Mobile next
PDF
Presentation for the first Bucharest Big data meetup
PDF
TCP/IP of money
Big data uservices
Bitcoin:Next
BigData in BlockChains
Lightning Network
Smart contracts using web3.js
Smart contracts in Solidity
Mashing the data
Big(data) in block(chains)
Enablers for o commerce
mcommad
NoSQL solutions
Deconstructing Lambda architectures
402 @ Mobile next
Presentation for the first Bucharest Big data meetup
TCP/IP of money

Recently uploaded (20)

PDF
Getting Started with Data Integration: FME Form 101
PPTX
1. Introduction to Computer Programming.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
A Presentation on Artificial Intelligence
Getting Started with Data Integration: FME Form 101
1. Introduction to Computer Programming.pptx
A comparative analysis of optical character recognition models for extracting...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Tartificialntelligence_presentation.pptx
Encapsulation theory and applications.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Assigned Numbers - 2025 - Bluetooth® Document
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
A Presentation on Artificial Intelligence

Data analysis with Pandas and Spark

  • 2. Who we are ● NETOPIA ● mobilPay ● mobilPay Wallet ● web2sms ● btko.in ● kartela.ro ● mobilender.mx
  • 4. Three dimensional problem ● Time: Past events or crystall ball? ● Profile: Who is looking at the data? ● Quantity: How much data is there to look at?
  • 6. Quantity ● Hundreds of MB to a few GB ● Up to million events/records vs. ● GB to TB to PB ● Hundreds of millions to billions and beyond events/records
  • 7. Also ● Computing vs. Storage ● Vertical vs. Horizontal scalability ● Distributed/ML libraries ● Dependency hell
  • 9. “Classic” Approach Small Data Big Data Data Engineer grep, sed, awk Java, Scala, Python, PIG, Hadoop, lately Spark & others Data Scientist R/RStudio No way, Josè!
  • 10. New Approach Small Data Big Data Data Engineer Notebook Technologies: Jupyter (most used), zeppelin, but also less known ones (Rodeo, Beaker) Data Scientist
  • 11. Data analysis with Jupyter, Pandas and Spark
  • 12. Outline About the data: ● Set of mobile transactions ● Set (separate) of retail transactions About the tools: Jupyter, Pandas and Spark Our experience Future work
  • 13. Mobile transactions Retail data Elements of analysis Transactions Transactions, Products, Stock data We know Transaction value, User identifier, Merchant Transaction value, Sold products, Merchant We don’t know What product was bought Who the user is Size Hundreds of thousands of entries Hundreds of millions of entries Status Building prediction models Gathering data Datasets
  • 15. SQL Database Mobile data: Environment Preprocessing notebooks Analysis and model testing notebooks Pandas R (with rpy2) scikit-learn Custom code CSV files pickle files Other input sources Jupyter notebooks in Docker container with Anaconda Diagnostics Cleaning Feature building Raw data Models Visualizations
  • 16. Docker image … with Anaconda ● Anaconda: package manager for data science ● Using docker-compose for setting up container parameters ● Many available images ● Our base image: ○ pyspark from Jupyter Docker Stacks ○ Extended with required libraries ● Libraries are added or updated with docker build: ○ Self-contained ○ Easy versioning
  • 17. Jupyter Notebook (1) Web application for creating documents with live code, explanations and visualizations ● Initially, part of IPython ● Narrative with live code ● Protocol for interactive exploration ○ Run blocks of code ○ Embedded JS ● Executable documents ○ Code ○ HTML and Markdown ○ Metadata ● Kernels for multiple languages ○ Python ○ R ○ Scala ○ Bash ● Internal format: JSON
  • 18. Jupyter Notebook (2) Web application for creating documents with live code, explanations and visualizations ● Plugins and widgets ● Easy to share (formats: Notebook, PDF, HTML, …) ● Large ecosystem ○ Jupyter Lab / Jupyter Hub ○ GitHub visualizations ○ Blog integration ○ Education: teaching, evaluation ○ Microsoft, Google, Bloomberg, IBM, O'Reilly ○ Executable books ● Versioning is complicated
  • 19. Pandas ● DataFrame objects ○ Tabular data structures ○ Each column has one data type ● Based on numpy (fast) ● Processing is (mostly) done in memory ● Data manipulation: ○ Hierarchical indexing ○ Reshaping, pivoting, grouping ○ String operations ○ Time series operations ● Reading / writing from / to many formats (CSV, JSON, HDF5, …) ● Visualization: matplotlib, Seaborn, Bokeh, … Python library for data manipulation and analysis
  • 20. rpy2 Interface between Python and R ● Translates DataFrames between Python and R ● Python in Jupyter: use %%R ● Direct access to R objects (rpy2.robjects)
  • 21. Jupyter, Pandas and R R with Rpy2 Python HTML and Markdown Notebook
  • 22. Mobile data: User retention Active users: ● Classic: 1+ transactions in a given period ● Rolling: 1+ transactions in a given or subsequent period Plots: ● X: period (day, week, month) ● Y (cohort): period or another type of segment ● By transaction criteria (merchant, product, etc.) Results: ● Response to campaigns ● Activity recurrence Cohorts Periods
  • 23. Mobile data: Correlations Features: ● How similar are two features? Merchants: ● Which merchants have common users? Products: ● Which products are sold together?
  • 24. Mobile data: Clusters ● Group users by behavior ● Identify outliers ● Future: automatic cluster labeling
  • 26. Retail data: Our experience First try: Out-of-core processing with HDF5 ● Data does did not fit in memory ● HDF5: format for large data ● Pandas + HDF5, Blaze, Dask, Odo ● Easy to use functions ● Library incompatibilities ● Slow queries, use indexes ● Occasional runtime errors
  • 27. Cassandra Retail data: Environment Preprocessing notebooks Analysis and model testing notebooks Large data: Spark ML + scikit-learn Small (selection) data: Pandas, scikit-learn and R CSV files Apache Parquet Cassandra Other input sources Jupyter notebooks in Docker containers with Spark and Anaconda Diagnostics Cleaning Feature building Raw data Models Visualizations In progress
  • 28. Spark Engine for big data processing ● DataFrames ○ Built on top of RDDs ○ Similar to Pandas and R ○ SQL queries ○ Automatic query optimization through query plan ○ String , date-time and statistics functions ○ Group by, filters ● Jupyter integration: work in progress https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
  • 29. Spark Machine Learning MLlib and ML ● MLlib ○ Uses RDDs ○ Summaries, correlations, sampling ○ SVMs, logistic regression, decision trees, ensembles and Naive Bayes ○ Clustering ○ Feature transformation ● ML ○ Works with DataFrames ○ Many wrappers for MLlib ○ Pipelines: ■ Transformers, Estimators, Parameters ■ labelCol, featuresCol, predictionCol, ... ○ R formulas (y ~ x1 + x2)
  • 30. Retail data: Our experience Current: Spark + Docker ● No issues at current size (several GBs) ● Docker Compose for creating master, workers and Jupyter container (driver) ● ML libraries are easy to work with ● Incomplete Python API for ML (e.g., summaries) ● Documentation needs improvement ● Model diagnostics ○ Some metrics are available ○ Supplement with scikit-learn (example: build ROC curves) ● scikit-learn or R on top of Spark ○ Parallelize parameter search (e.g., grid search) ○ Spark sklearn (github.com/databricks/spark-sklearn): Grid Search
  • 31. Future work Mobile wallet transactions: ● Data fits in memory ● Use Spark for distributing workload ERP transactions: ● Some data fits in memory, after processing ● Build a web app for data exploration ● Forecast ○ Sales ○ Inventory requirements ● Try Spark Streaming http://xkcd.com/1425/