SlideShare a Scribd company logo
Practical Medium Data
Analytics with Python
PyData NYC 2013
Practical Medium Data
Analytics with Python
10 Things I Hate
About pandas
PyData NYC 2013
Wes McKinney
@wesmckinn
• Former quant and MIT math dude
• Creator of Pandas project for Python
• Author of
Python for Data Analysis — O’Reilly

• Founder and CEO of DataPad

3

www.datapad.io
•
•

4

> 20k copies since Oct 2012
Bringing many new people
to Python and data analysis
with code

www.datapad.io
• http://datapad.io
Founded in 2013, located in SF
•
In private beta, join us!
•
• Hiring for engineering
www.datapad.io
Why hate on pandas?
7

www.datapad.io
pandas rocks!
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)
So, pandas
• Easy-to-use, fast in-memory data wrangling
and analytics library

• Enabled loads of complex data work to be
done by mere mortals in Python

• Might have kept R from taking over the
world (hehe)

10

www.datapad.io
11

www.datapad.io
pandas, the project

• 170 distinct contributors
• Over 5400 issues and pull requests
on GitHub

•
12

Upcoming 0.13 release

www.datapad.io
But.

• pandas’s broad applicability also a
liability

•
pandas being used in some
•

Only game in town for lot of things
unplanned ways

13

www.datapad.io
Some things to love
• No more structured dtype drudgery!
• Easy IO!
• Data alignment!
• Hierarchical indexing!
• Time series analytics!
14

www.datapad.io
More things to love

• Table reshaping
• Missing data handling
pandas.merge, pandas.concat
•
Expressive groupby machinery
•
15

www.datapad.io
Some pandas use cases

• General data wrangling
• ETL jobs
Business analytics (incl. BI uses)
•
Time series analysis, statistical
•
modeling

16

www.datapad.io
pandas does many things
that are tedious, slow, or
difficult to do correctly
without it
Unfortunately, pandas is
not a database
#1 Slightly too far from
the metal

• DataFrame’s internal structure

intended to make row-oriented ops
fast on numerical data

•
19

Python objects can be used as data,
indices (a feature, not a bug)
www.datapad.io
#2 No support (yet) for
memory maps
• Many analytics ops require a small portion
of the data

• Many ways to “materialize” the full data set
in memory by accident

• Axis indexes wouldn’t necessarily make
sense on out of core data sets

20

www.datapad.io
#2 No support (yet) for
memory maps

• N.B. HDF5/PyTables support is a
partial solution

21

www.datapad.io
#3 No tight database
integration

• Makes it difficult to be a serious tool
in an ETL toolchain on top of some
SQL-ish system

•
22

Inadequacy of pandas/NumPy data
type systems
www.datapad.io
#3 No tight database
integration

• Jobs with heavy SQL-reading are
slow and use tons of memory

•

23

TODO: integrate pandas with ODBC
C API and write out SQL data directly
into NumPy arrays
www.datapad.io
#4 Best-efforts NA
representation

• Inconsistent representation of
missing data

•
NA needs to be a first class citizen in
•
No Boolean or Integer NA values
analytics operations

24

www.datapad.io
#5 RAM management
• Difficult to understand footprint of pandas
object

• Ample data copying throughout library
• Would benefit from being able to compress

data in-memory or shuttle data temporarily
to disk

25

www.datapad.io
#6 Weak support for
categorical data

• Makes pandas not quite a fullyfledged R replacement

•

26

GroupBy and Joins slower than they
could be

www.datapad.io
#7 Complex GroupBy
operations get messy

• Must write custom functions to pass
to .apply(..)

•

27

Easy to run up against DRY
problems and general Python
syntax limitations
www.datapad.io
#8 Appending data slow
and tedious

• DataFrame not intended as a
database table

•

Makes streaming data use a
challenge

• B+ tree tables interesting?
28

www.datapad.io
#9 Limited type system,
column metadata

• Currencies, units
• Time zones
Geographic data
•
Composite data types
•
29

www.datapad.io
#10 No true query
processing layer

•
•
•
•
•
•
30

Filter
Group
Join
Aggregate
Limit/TopK
Sorting

WHERE, HAVING
GROUP BY
JOIN
SUM, MEAN, ...
LIMIT
ORDER BY
www.datapad.io
#11 “Slow”: no multicore /
distributed algos

• Hampered by use of Python data
structures / GIL interactions

•

31

Object internals not designed for
concurrent use

www.datapad.io
Oh no what do we do
Stop believing in the “one
tool to rule them all”
“Real Artists Ship”
- Steve Jobs
www.datapad.io
Focus on results

• I am heavily biased by focus on
business analytics/BI use cases

•

36

Need production-ready software to
ship in relatively short time frame

www.datapad.io
A new project

• In internal development at DataPad
• Code named “badger”
pandas-ish syntax: designed for
•
data processing and analytical
queries

37

www.datapad.io
Badger in a nutshell

•
Compressed columnar binary storage
•
• High perf analytical query processor
• Data preparation/cleaning tools
Consistent data type system

38

www.datapad.io
Badger in a nutshell

•
Immutable array data, little copying
•
• Analytics kernels: written C with no
Time series analytics

dependencies

•
39

Caching of useful intermediates
www.datapad.io
Some benchmarks

• Data set: 2012 Election data (FEC)
5.3 mm records 7 columns
•
• Tools
• pandas
badger
•
• R: data.table
SQL: PostgreSQL, SQLite
•
40

www.datapad.io
Query 1

• Total contributions by candidate
SELECT	
  cand_nm,	
  
	
  	
  	
  	
  	
  	
  	
  sum(contb_receipt_amt)	
  AS	
  total
FROM	
  fec
GROUP	
  BY	
  cand_nm

41

www.datapad.io
Query 1

• Total contributions by candidate
badger	
  (in-­‐memory)	
  :	
  	
  	
  19ms	
  (1x)
badger	
  (from-­‐disk)	
  :	
  	
  131ms	
  (6.9x)
pandas	
  (in-­‐memory)	
  :	
  	
  273ms	
  (14.3x)
R	
  data.table	
  1.8.10:	
  	
  382ms	
  (20x)
PostgreSQL	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  	
  4.7s	
  (247x)
SQLite	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  	
  	
  72s	
  (3800x)

42

www.datapad.io
Query 2
contributions by candidate
• Totalstate
and
SELECT	
  cand_nm,	
  contbr_st,
	
  	
  	
  	
  	
  	
  	
  sum(contb_receipt_amt)	
  AS	
  total
FROM	
  fec
GROUP	
  BY	
  cand_nm,	
  contbr_st

43

www.datapad.io
Query 2

•

Total contributions by candidate and
state

badger	
  (in-­‐memory)	
  :	
  	
  269ms	
  (1x)
badger	
  (from-­‐disk)	
  :	
  	
  391ms	
  (1.5x)
R	
  data.table	
  1.8.10:	
  	
  500ms	
  (1.8x)
pandas	
  (in-­‐memory)	
  :	
  	
  770ms	
  (2.9x)
PostgreSQL	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  5.96s	
  (23x)

44

www.datapad.io
Query 3

• Total contributions by candidate
and state with 2 filter predicates

SELECT	
  cand_nm,
	
  	
  	
  	
  	
  	
  	
  sum(contb_receipt_amt)	
  as	
  total
FROM	
  fec
WHERE	
  contb_receipt_dt	
  BETWEEN
	
  	
  	
  	
  	
  	
  	
  	
  '2012-­‐05-­‐01'	
  and	
  '2012-­‐11-­‐05'
	
  	
  AND	
  contb_receipt_amt	
  BETWEEN	
  
	
  	
  	
  	
  	
  	
  	
  	
  0	
  and	
  2500
GROUP	
  BY	
  cand_nm
45

www.datapad.io
Query 3

• Total contributions by candidate
and state with 2 filter predicates

badger	
  (in-­‐memory)	
  :	
  	
  	
  96ms	
  (1x)
badger	
  (from-­‐disk)	
  :	
  	
  275ms	
  (2.9x)
pandas	
  (in-­‐memory)	
  :	
  	
  946ms	
  (9.8x)
PostgreSQL	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  	
  6.2s	
  (65x)

46

www.datapad.io
Badger, the future

• Distributed in-memory analytics
• Multicore algorithms
• ETL job-building tools
• Open source in some form someday
Looking for algorithms hackers to help
•
47

www.datapad.io
Thank you!

48

www.datapad.io

More Related Content

PDF
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
PPTX
YugaByte DB Internals - Storage Engine and Transactions
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PDF
MariaDB MaxScale
PDF
MySQL Ecosystem in 2023 - FOSSASIA'23 - Alkin.pptx.pdf
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PDF
Best practices for Terraform with Vault
PPTX
MySQL8.0_performance_schema.pptx
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
YugaByte DB Internals - Storage Engine and Transactions
How We Optimize Spark SQL Jobs With parallel and sync IO
MariaDB MaxScale
MySQL Ecosystem in 2023 - FOSSASIA'23 - Alkin.pptx.pdf
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Best practices for Terraform with Vault
MySQL8.0_performance_schema.pptx

What's hot (20)

PDF
Cassandra Database
PPTX
MariaDB Galera Cluster
PDF
Introduction to Redis
PDF
Inside PostgreSQL Shared Memory
 
PDF
New Directions for Apache Arrow
PDF
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PPTX
Apache Spark Architecture
PPT
Oracle Transparent Data Encryption (TDE) 12c
PDF
LinuxCon 2015 Linux Kernel Networking Walkthrough
PDF
Using ClickHouse for Experimentation
PDF
Introduction to Red Hat OpenShift 4
PDF
Maxscale_메뉴얼
PDF
PostgreSQL High Availability in a Containerized World
PDF
Galera cluster for high availability
PDF
Introduction to MongoDB
PPTX
【第二回 ゼロからはじめる Oracle Solaris 11】02 Solaris 11 を支える最強のファイルシステム ZFS ~ ZFS ファイルシ...
PDF
PostgreSQL Performance Tuning
PDF
EKS vs GKE vs AKS - Evaluating Kubernetes in the Cloud
PPTX
Hive, Presto, and Spark on TPC-DS benchmark
Cassandra Database
MariaDB Galera Cluster
Introduction to Redis
Inside PostgreSQL Shared Memory
 
New Directions for Apache Arrow
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Apache Spark Architecture
Oracle Transparent Data Encryption (TDE) 12c
LinuxCon 2015 Linux Kernel Networking Walkthrough
Using ClickHouse for Experimentation
Introduction to Red Hat OpenShift 4
Maxscale_메뉴얼
PostgreSQL High Availability in a Containerized World
Galera cluster for high availability
Introduction to MongoDB
【第二回 ゼロからはじめる Oracle Solaris 11】02 Solaris 11 を支える最強のファイルシステム ZFS ~ ZFS ファイルシ...
PostgreSQL Performance Tuning
EKS vs GKE vs AKS - Evaluating Kubernetes in the Cloud
Hive, Presto, and Spark on TPC-DS benchmark
Ad

Similar to Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013) (20)

PDF
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
PDF
Intro to Big Data - Spark
PDF
Building Better Analytics Workflows (Strata-Hadoop World 2013)
PDF
Data Science meets Software Development
PDF
A data analyst view of Bigdata
PDF
Continuum Analytics and Python
PPTX
Games Industry Analytics Forum 2 - Plumbee
PDF
Workflow Hacks #1 - dots. Tokyo
PDF
DSD-INT 2017 The use of big data for dredging - De Boer
PDF
Data Discovery and Metadata
PPTX
Data web analytics scraping 12345_II.pptx
PPTX
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
PDF
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PPTX
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
PDF
Data Warehousing 2016
PDF
pandas: Powerful data analysis tools for Python
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
New Capabilities in the PyData Ecosystem
PDF
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
Intro to Big Data - Spark
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Data Science meets Software Development
A data analyst view of Bigdata
Continuum Analytics and Python
Games Industry Analytics Forum 2 - Plumbee
Workflow Hacks #1 - dots. Tokyo
DSD-INT 2017 The use of big data for dredging - De Boer
Data Discovery and Metadata
Data web analytics scraping 12345_II.pptx
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Data Warehousing 2016
pandas: Powerful data analysis tools for Python
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
New Capabilities in the PyData Ecosystem
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Ad

More from Wes McKinney (20)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
Apache Arrow: Leveling Up the Analytics Stack
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Apache Arrow: Leveling Up the Data Science Stack
PDF
Ursa Labs and Apache Arrow in 2019
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PPTX
Shared Infrastructure for Data Science
PDF
Data Science Without Borders (JupyterCon 2017)
PPTX
Memory Interoperability in Analytics and Machine Learning
PPTX
Raising the Tides: Open Source Analytics for Data Science
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow Flight: A New Gold Standard for Data Transport
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow: Present and Future @ ScaledML 2020
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow: Leveling Up the Data Science Stack
Ursa Labs and Apache Arrow in 2019
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow -- Cross-language development platform for in-memory data
Shared Infrastructure for Data Science
Data Science Without Borders (JupyterCon 2017)
Memory Interoperability in Analytics and Machine Learning
Raising the Tides: Open Source Analytics for Data Science

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
Empathic Computing: Creating Shared Understanding
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Advanced IT Governance
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
cuic standard and advanced reporting.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Electronic commerce courselecture one. Pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
20250228 LYD VKU AI Blended-Learning.pptx
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
Empathic Computing: Creating Shared Understanding
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Advanced IT Governance
madgavkar20181017ppt McKinsey Presentation.pdf
NewMind AI Weekly Chronicles - August'25 Week I
cuic standard and advanced reporting.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Electronic commerce courselecture one. Pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Sensors and Actuators in IoT Systems using pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

  • 1. Practical Medium Data Analytics with Python PyData NYC 2013
  • 2. Practical Medium Data Analytics with Python 10 Things I Hate About pandas PyData NYC 2013
  • 3. Wes McKinney @wesmckinn • Former quant and MIT math dude • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 3 www.datapad.io
  • 4. • • 4 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
  • 5. • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
  • 6. Why hate on pandas?
  • 10. So, pandas • Easy-to-use, fast in-memory data wrangling and analytics library • Enabled loads of complex data work to be done by mere mortals in Python • Might have kept R from taking over the world (hehe) 10 www.datapad.io
  • 12. pandas, the project • 170 distinct contributors • Over 5400 issues and pull requests on GitHub • 12 Upcoming 0.13 release www.datapad.io
  • 13. But. • pandas’s broad applicability also a liability • pandas being used in some • Only game in town for lot of things unplanned ways 13 www.datapad.io
  • 14. Some things to love • No more structured dtype drudgery! • Easy IO! • Data alignment! • Hierarchical indexing! • Time series analytics! 14 www.datapad.io
  • 15. More things to love • Table reshaping • Missing data handling pandas.merge, pandas.concat • Expressive groupby machinery • 15 www.datapad.io
  • 16. Some pandas use cases • General data wrangling • ETL jobs Business analytics (incl. BI uses) • Time series analysis, statistical • modeling 16 www.datapad.io
  • 17. pandas does many things that are tedious, slow, or difficult to do correctly without it
  • 19. #1 Slightly too far from the metal • DataFrame’s internal structure intended to make row-oriented ops fast on numerical data • 19 Python objects can be used as data, indices (a feature, not a bug) www.datapad.io
  • 20. #2 No support (yet) for memory maps • Many analytics ops require a small portion of the data • Many ways to “materialize” the full data set in memory by accident • Axis indexes wouldn’t necessarily make sense on out of core data sets 20 www.datapad.io
  • 21. #2 No support (yet) for memory maps • N.B. HDF5/PyTables support is a partial solution 21 www.datapad.io
  • 22. #3 No tight database integration • Makes it difficult to be a serious tool in an ETL toolchain on top of some SQL-ish system • 22 Inadequacy of pandas/NumPy data type systems www.datapad.io
  • 23. #3 No tight database integration • Jobs with heavy SQL-reading are slow and use tons of memory • 23 TODO: integrate pandas with ODBC C API and write out SQL data directly into NumPy arrays www.datapad.io
  • 24. #4 Best-efforts NA representation • Inconsistent representation of missing data • NA needs to be a first class citizen in • No Boolean or Integer NA values analytics operations 24 www.datapad.io
  • 25. #5 RAM management • Difficult to understand footprint of pandas object • Ample data copying throughout library • Would benefit from being able to compress data in-memory or shuttle data temporarily to disk 25 www.datapad.io
  • 26. #6 Weak support for categorical data • Makes pandas not quite a fullyfledged R replacement • 26 GroupBy and Joins slower than they could be www.datapad.io
  • 27. #7 Complex GroupBy operations get messy • Must write custom functions to pass to .apply(..) • 27 Easy to run up against DRY problems and general Python syntax limitations www.datapad.io
  • 28. #8 Appending data slow and tedious • DataFrame not intended as a database table • Makes streaming data use a challenge • B+ tree tables interesting? 28 www.datapad.io
  • 29. #9 Limited type system, column metadata • Currencies, units • Time zones Geographic data • Composite data types • 29 www.datapad.io
  • 30. #10 No true query processing layer • • • • • • 30 Filter Group Join Aggregate Limit/TopK Sorting WHERE, HAVING GROUP BY JOIN SUM, MEAN, ... LIMIT ORDER BY www.datapad.io
  • 31. #11 “Slow”: no multicore / distributed algos • Hampered by use of Python data structures / GIL interactions • 31 Object internals not designed for concurrent use www.datapad.io
  • 32. Oh no what do we do
  • 33. Stop believing in the “one tool to rule them all”
  • 36. Focus on results • I am heavily biased by focus on business analytics/BI use cases • 36 Need production-ready software to ship in relatively short time frame www.datapad.io
  • 37. A new project • In internal development at DataPad • Code named “badger” pandas-ish syntax: designed for • data processing and analytical queries 37 www.datapad.io
  • 38. Badger in a nutshell • Compressed columnar binary storage • • High perf analytical query processor • Data preparation/cleaning tools Consistent data type system 38 www.datapad.io
  • 39. Badger in a nutshell • Immutable array data, little copying • • Analytics kernels: written C with no Time series analytics dependencies • 39 Caching of useful intermediates www.datapad.io
  • 40. Some benchmarks • Data set: 2012 Election data (FEC) 5.3 mm records 7 columns • • Tools • pandas badger • • R: data.table SQL: PostgreSQL, SQLite • 40 www.datapad.io
  • 41. Query 1 • Total contributions by candidate SELECT  cand_nm,                sum(contb_receipt_amt)  AS  total FROM  fec GROUP  BY  cand_nm 41 www.datapad.io
  • 42. Query 1 • Total contributions by candidate badger  (in-­‐memory)  :      19ms  (1x) badger  (from-­‐disk)  :    131ms  (6.9x) pandas  (in-­‐memory)  :    273ms  (14.3x) R  data.table  1.8.10:    382ms  (20x) PostgreSQL                  :      4.7s  (247x) SQLite                          :        72s  (3800x) 42 www.datapad.io
  • 43. Query 2 contributions by candidate • Totalstate and SELECT  cand_nm,  contbr_st,              sum(contb_receipt_amt)  AS  total FROM  fec GROUP  BY  cand_nm,  contbr_st 43 www.datapad.io
  • 44. Query 2 • Total contributions by candidate and state badger  (in-­‐memory)  :    269ms  (1x) badger  (from-­‐disk)  :    391ms  (1.5x) R  data.table  1.8.10:    500ms  (1.8x) pandas  (in-­‐memory)  :    770ms  (2.9x) PostgreSQL                  :    5.96s  (23x) 44 www.datapad.io
  • 45. Query 3 • Total contributions by candidate and state with 2 filter predicates SELECT  cand_nm,              sum(contb_receipt_amt)  as  total FROM  fec WHERE  contb_receipt_dt  BETWEEN                '2012-­‐05-­‐01'  and  '2012-­‐11-­‐05'    AND  contb_receipt_amt  BETWEEN                  0  and  2500 GROUP  BY  cand_nm 45 www.datapad.io
  • 46. Query 3 • Total contributions by candidate and state with 2 filter predicates badger  (in-­‐memory)  :      96ms  (1x) badger  (from-­‐disk)  :    275ms  (2.9x) pandas  (in-­‐memory)  :    946ms  (9.8x) PostgreSQL                  :      6.2s  (65x) 46 www.datapad.io
  • 47. Badger, the future • Distributed in-memory analytics • Multicore algorithms • ETL job-building tools • Open source in some form someday Looking for algorithms hackers to help • 47 www.datapad.io