SlideShare a Scribd company logo
Data Science at Scale:
Using Apache Spark for Data Science
at Bitly
Sarah Guido
Data Day Seattle 2015
Overview
• About me/Bitly
• Spark overview
• Using Spark for data science
• When it works, it’s great! When it works…
About me
• Data scientist at Bitly
• NYC Python/PyGotham co-organizer
• O’Reilly Media author
• @sarah_guido
About this talk
• This talk is:
– Description of my workflow
– Exploration of within-Spark tools
• This talk is not:
– In-depth exploration of algorithms
– Building new tools on top of Spark
– Any sort of ground truth for how you should be
using Spark
A bit of background
• Need for big data analysis tools
• MapReduce for exploratory data analysis == 
• Iterate/prototype quickly
• Overall goal: understand how people use not
only our app, but the Internet!
Bitly data!
• Legit big data
• 1 hour of decodes is 10 GB
• 1 day is 240 GB
• 1 month is ~7 TB
What is Spark?
• Large-scale distributed data processing tool
• SQL and streaming tools
• Faster than Hadoop
• Python API
How does Spark work?
• Partitions your data to operate over in parallel
– A partition by default is 64 MB
• Capability to add map/reduce features
• Lazy – only operates when method is called
– Ex. collect() or writing to a file
Why Spark?
• Fast. Really fast.
• SQL layer – kind of like Hive
• Distributed scientific tools
• Python! Sometimes.
• Cutting edge technology
Setting up the workflow
• Spark journey
– Hadoop server: 1.2
– EMR: 1.3
– EMR: 1.4
How do I use it?
• EMR!
• spark-submit on the cluster
• Can add script as a step to cluster launch
Creating a cluster
• aws emr create-cluster
• --bootstrap-action
• --steps
• --auto-terminate
Creating a cluster
• LIVE!!
Let’s set the stage…
• Understanding user behavior
• How do I extract, explore, and model a subset
of our data using Spark?
Data
{"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2)
AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4
Safari/600.4.10",
"c": "US",
"nk": 0,
"tz": "America/Los_Angeles",
"g": "1HfTjh8",
"h": "1HfTjh7",
"u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-
health-care-tech-is-still-so-bad.html?smid=tw-share",
"t": 1427288425,
"cy": "Seattle"}
Data processing
• Problem: I want to retrieve NYT decodes
• Solution: well, there are two…
Data processing
Data processing
Data processing
• SparkSQL: 8 minutes
• Pure Spark: 4 minutes!!!
Data processing
Data processing
• Yes, we’re going to do a live demo of this!
Exploratory data analysis
• Problem: what’s going on with my decodes?
• Solution: DataFrames!
– Similar to Pandas: describe, drop, fill, aggregate
functions
– You can actually convert to a Pandas DataFrame!
Exploratory data analysis
• Get a sense of what’s going on in the data
• Look at distributions, frequencies
• Mostly categorical data here
Exploratory data analysis
• Yet another live demo
Topic modeling
• Problem: we have so many links but no way to
classify them into certain kinds of content
• Solution: LDA (latent Dirichlet allocation)
– Sort of – compare to other solutions
Topic modeling
• Oh, the JVM…
– LDA only in Scala
• Scala jar file
• Store script in S3
Topic modeling
• LDA in Spark
– Generative model
– Several different methods
– Term frequency vector as input
• “Note: LDA is a new feature with some missing
functionality...”
Topic modeling
Topic modeling
• Term frequency vector
TERM
DOCUMENT
python data hot dogs baseball zoo
doc_1 1 3 0 0 0
doc_2 0 0 4 1 0
doc_3 4 0 0 0 5
Topic modeling
Topic modeling
Topic modeling
• Why not??
– Means to an end
– Current large scale scraping inability
Architecture
• Right now: not in production
– Buy-in
• Streaming applications for parts of the app
• Python or Scala?
– Scala by force (LDA, GraphX)
Some issues
• Hadoop servers
• JVM
• gzip
• 1.4
• Resource allocation
• Really only got it to this stage very recently
Where to go next?
• Spark in production!
• Use for various parts of our app
• Use for R&D and prototyping purposes, with
the potential to expand into the product
Current/future projects
• Trend detection
• Device prediction
• User affinities
– GraphX!
• A/B testing
Resources
• spark.apache.org - documentation
• Databricks blog
• Cloudera blog
Thanks!!
@sarah_guido
Ad

Recommended

Spark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the Ugly
Sarah Guido
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
Bitly
 
Analyzing Data With Python
Analyzing Data With Python
Sarah Guido
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
Wes McKinney
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Databricks
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
Solr for Data Science
Solr for Data Science
Grant Ingersoll
 
Pandas/Data Analysis at Baypiggies
Pandas/Data Analysis at Baypiggies
Andy Hayden
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Databricks
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
MLconf
 
sparklyr - Jeff Allen
sparklyr - Jeff Allen
Sri Ambati
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Databricks
 
Hadoop for Data Science
Hadoop for Data Science
Donald Miner
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
Indexing big data in the cloud
Indexing big data in the cloud
OpenSource Connections
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Spark Summit
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Thamme Gowda
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
William Lyon
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
Big Data Spain
 
Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 

More Related Content

What's hot (20)

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
Solr for Data Science
Solr for Data Science
Grant Ingersoll
 
Pandas/Data Analysis at Baypiggies
Pandas/Data Analysis at Baypiggies
Andy Hayden
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Databricks
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
MLconf
 
sparklyr - Jeff Allen
sparklyr - Jeff Allen
Sri Ambati
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Databricks
 
Hadoop for Data Science
Hadoop for Data Science
Donald Miner
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
Indexing big data in the cloud
Indexing big data in the cloud
OpenSource Connections
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Spark Summit
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Thamme Gowda
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
William Lyon
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
Big Data Spain
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
Pandas/Data Analysis at Baypiggies
Pandas/Data Analysis at Baypiggies
Andy Hayden
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Databricks
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
MLconf
 
sparklyr - Jeff Allen
sparklyr - Jeff Allen
Sri Ambati
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Databricks
 
Hadoop for Data Science
Hadoop for Data Science
Donald Miner
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Spark Summit
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Thamme Gowda
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
William Lyon
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
Big Data Spain
 

Viewers also liked (20)

Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Spark Sql for Training
Spark Sql for Training
Bryan Yang
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
Alexander DEJANOVSKI
 
The SparkSQL things you maybe confuse
The SparkSQL things you maybe confuse
vito jeng
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
vithakur
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
DataWorks Summit/Hadoop Summit
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
NAYATech
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
Bryan Yang
 
Spark etl
Spark etl
Imran Rashid
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
Josh Elser
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Spark Sql for Training
Spark Sql for Training
Bryan Yang
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
Alexander DEJANOVSKI
 
The SparkSQL things you maybe confuse
The SparkSQL things you maybe confuse
vito jeng
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
vithakur
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
DataWorks Summit/Hadoop Summit
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
NAYATech
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
Bryan Yang
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
Josh Elser
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
Ad

Similar to Data Science at Scale: Using Apache Spark for Data Science at Bitly (20)

Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit
 
Apache Spark in Industry
Apache Spark in Industry
Dorian Beganovic
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Spark SQL
Spark SQL
Caserta
 
Apache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Domino Data Lab
 
Hadoop Ecosystem
Hadoop Ecosystem
Lior Sidi
 
Hadoop world overview trends and topics
Hadoop world overview trends and topics
Valentin Kropov
 
Spark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
Machine Learning at Scale
Machine Learning at Scale
Madhukara Phatak
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Apache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Domino Data Lab
 
Hadoop Ecosystem
Hadoop Ecosystem
Lior Sidi
 
Hadoop world overview trends and topics
Hadoop world overview trends and topics
Valentin Kropov
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Ad

More from Sarah Guido (7)

Data Science Retrospective
Data Science Retrospective
Sarah Guido
 
The Wild West of Data Wrangling (PyTN)
The Wild West of Data Wrangling (PyTN)
Sarah Guido
 
The Wild West of Data Wrangling
The Wild West of Data Wrangling
Sarah Guido
 
The Importance of Community
The Importance of Community
Sarah Guido
 
Network theory - PyCon 2015
Network theory - PyCon 2015
Sarah Guido
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
Sarah Guido
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido
 
Data Science Retrospective
Data Science Retrospective
Sarah Guido
 
The Wild West of Data Wrangling (PyTN)
The Wild West of Data Wrangling (PyTN)
Sarah Guido
 
The Wild West of Data Wrangling
The Wild West of Data Wrangling
Sarah Guido
 
The Importance of Community
The Importance of Community
Sarah Guido
 
Network theory - PyCon 2015
Network theory - PyCon 2015
Sarah Guido
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
Sarah Guido
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido
 

Recently uploaded (20)

You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 

Data Science at Scale: Using Apache Spark for Data Science at Bitly

  • 1. Data Science at Scale: Using Apache Spark for Data Science at Bitly Sarah Guido Data Day Seattle 2015
  • 2. Overview • About me/Bitly • Spark overview • Using Spark for data science • When it works, it’s great! When it works…
  • 3. About me • Data scientist at Bitly • NYC Python/PyGotham co-organizer • O’Reilly Media author • @sarah_guido
  • 4. About this talk • This talk is: – Description of my workflow – Exploration of within-Spark tools • This talk is not: – In-depth exploration of algorithms – Building new tools on top of Spark – Any sort of ground truth for how you should be using Spark
  • 5. A bit of background • Need for big data analysis tools • MapReduce for exploratory data analysis ==  • Iterate/prototype quickly • Overall goal: understand how people use not only our app, but the Internet!
  • 6. Bitly data! • Legit big data • 1 hour of decodes is 10 GB • 1 day is 240 GB • 1 month is ~7 TB
  • 7. What is Spark? • Large-scale distributed data processing tool • SQL and streaming tools • Faster than Hadoop • Python API
  • 8. How does Spark work? • Partitions your data to operate over in parallel – A partition by default is 64 MB • Capability to add map/reduce features • Lazy – only operates when method is called – Ex. collect() or writing to a file
  • 9. Why Spark? • Fast. Really fast. • SQL layer – kind of like Hive • Distributed scientific tools • Python! Sometimes. • Cutting edge technology
  • 10. Setting up the workflow • Spark journey – Hadoop server: 1.2 – EMR: 1.3 – EMR: 1.4
  • 11. How do I use it? • EMR! • spark-submit on the cluster • Can add script as a step to cluster launch
  • 12. Creating a cluster • aws emr create-cluster • --bootstrap-action • --steps • --auto-terminate
  • 14. Let’s set the stage… • Understanding user behavior • How do I extract, explore, and model a subset of our data using Spark?
  • 15. Data {"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why- health-care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}
  • 16. Data processing • Problem: I want to retrieve NYT decodes • Solution: well, there are two…
  • 19. Data processing • SparkSQL: 8 minutes • Pure Spark: 4 minutes!!!
  • 21. Data processing • Yes, we’re going to do a live demo of this!
  • 22. Exploratory data analysis • Problem: what’s going on with my decodes? • Solution: DataFrames! – Similar to Pandas: describe, drop, fill, aggregate functions – You can actually convert to a Pandas DataFrame!
  • 23. Exploratory data analysis • Get a sense of what’s going on in the data • Look at distributions, frequencies • Mostly categorical data here
  • 24. Exploratory data analysis • Yet another live demo
  • 25. Topic modeling • Problem: we have so many links but no way to classify them into certain kinds of content • Solution: LDA (latent Dirichlet allocation) – Sort of – compare to other solutions
  • 26. Topic modeling • Oh, the JVM… – LDA only in Scala • Scala jar file • Store script in S3
  • 27. Topic modeling • LDA in Spark – Generative model – Several different methods – Term frequency vector as input • “Note: LDA is a new feature with some missing functionality...”
  • 29. Topic modeling • Term frequency vector TERM DOCUMENT python data hot dogs baseball zoo doc_1 1 3 0 0 0 doc_2 0 0 4 1 0 doc_3 4 0 0 0 5
  • 32. Topic modeling • Why not?? – Means to an end – Current large scale scraping inability
  • 33. Architecture • Right now: not in production – Buy-in • Streaming applications for parts of the app • Python or Scala? – Scala by force (LDA, GraphX)
  • 34. Some issues • Hadoop servers • JVM • gzip • 1.4 • Resource allocation • Really only got it to this stage very recently
  • 35. Where to go next? • Spark in production! • Use for various parts of our app • Use for R&D and prototyping purposes, with the potential to expand into the product
  • 36. Current/future projects • Trend detection • Device prediction • User affinities – GraphX! • A/B testing
  • 37. Resources • spark.apache.org - documentation • Databricks blog • Cloudera blog