SlideShare a Scribd company logo
Kelvin Chu, Hadoop Platform, Uber
Gang Wu, Hadoop Platform, Uber
Spark Uber Development Kit
Spark Summit 2016
June 07, 2016
“Transportation as reliable as
running water, everywhere,
for everyone”
Uber Mission
About Us
● Hadoop team of Data Infrastructure at Uber
● Schema systems
● HDFS data lake
● Analytics engines on Hadoop
● Spark computing framework and toolings
Execution Environment
Complexity
Cluster Sizes
20 times
YARN Mesos
Docker JVM
Parquet ORC
Sequence Text
Home Built Services
Hive Kafka ELK
Consequence:
Pretty hard for beginners, sometimes hard for experienced
users too.
Goals:
Multi-Platform: Abstract out environment
Self-Service: Create and run Spark jobs super easily
Reliability: Prevent harm to infrastructure systems
Engineers SRE
API
Tools
Engineers SRE
API
Tools
Easy
Self-Service
Multi-Platform
No Harm
Reliability
• SCBuilder
• Kafka dispersal
• SparkPlug
Engineers SRE
API
Tools
SCBuilder
Encapsulate cluster environment details
● Builder Pattern for SparkContext
● Incentive for users:
○ performance optimized (default can’t pass 100GB)
○ debug optimized (history server, event logs)
○ don’t need to ask around YARN, history servers, HDFS configs
● Best practices enforcement:
○ SRE approved CPU and memory settings
○ resource efficient serialization
Kafka Dispersal
Kafka as data sink of RDD result
● Incentive for users:
○ RDD as first class citizen => parallelization
○ built-in HA
● Best practices enforcement:
○ rate limiting
○ message integrity by schema
○ bad messages tracking
publish(data: RDD, topic: String, schemaId: Int, appId: String)
SparkPlug
Kickstart job development
● A collection of popular job templates
○ Two commands to run the first job in Dev
● One use case per template
○ e.g. Ozzie + SparkSQL + Incremental processing
○ e.g. Incremental processing + Kafka dispersal
● Best Practices
○ built-in unit tests, test coverage, Jenkins
○ built-in Kafka, HDFS mocks
Example Spark Application Data Flow Chart
Query
UI
ELK Search
Dashboards
Report
HIVE
Topic 1
Topic 1
Topic 1
Topic 2
Topic 2
Topic 2
Topic 3
Topic 3
Topic 4
HIVE
Shared
DB
HIVE
Custom
DB
App
Web
UI
App
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
A
B
Title: Job
message
TITLE TITLE TITLE: WORKFLOW ENV
NAME
TITLE: WORKFLOW ENV ID TITLE: LOCATION
• Geo-spatial processing
• SCBuilder
• Kafka dispersal
• SparkPlug
Engineers SRE
API
Tools
Commonly used UDFs
GeoSpatial UDF
within(trip_location, city_shape) contains(geofence, auto_location)
Find if a car is inside a city Find all autos in one area
Commonly used UDFs
GeoSpatial UDF
overlaps(trip1, trip2) intersects(trip_location, gas_location)
Find trips that have similar
routes
Find all gas stations a trip route has
passed by
Objective: associate all trips with city_id for a single day.
SELECT trip.trip_id, city.city_id
FROM trip JOIN city
WHERE contains(city.city_shape, trip.start_location)
AND trip.datestr = ‘2016-06-07’
Spatial Join
Common query at Uber
Spatial Join
Problem
It takes nearly ONE WEEK to run at Uber’s data scale.
1. Spark does not have broadcast join optimization for non-equation join.
2. Not scalable, only one executor is used for cartesian join.
Spatial Join
Build a UDF to broadcast geo-spatial index
Spatial Join
Runtime Index Generation
Index data is small but change often (city table)
Get fields from geo tables (city_id and city_shape)
Build QuadTree or RTree index at Spark Driver
1. Build Index
Spatial Join
Executor Execution
UDF code is part of the Spark UDK jar.
⇒ get_city_id(location), returns city_id of a location
Use the broadcasted spatial index for fast spatial retrieval
2. Broadcast Index
Spatial Join
Runtime UDF Generation
SELECT
trip_id, get_city_id(start_location)
FROM
trip
WHERE
datestr = ‘2016-06-07’
3. Rewrite Query (2 mins only! compared to 1 week before)
• Geo-spatial processing
• SCBuilder
• Kafka dispersal
• SparkChamber
• SparkPlug
Engineers SRE
API
Tools
Spark Debugging
1. Tons of local log files across many
machines.
2. Overall file size is huge and difficult
to be handled by a single machine.
3. Painful for debugging, which log is
useful?
Spark Chamber
Distributed Log Debugger for Spark
Extend Spark Shell by Hooks.
Easy to adopt for Spark developers.
Interactive
Spark Chamber Session
my_user_name
Spark Uber Development Kit
Spark Uber Development Kit
Spark Uber Development Kit
Spark Uber Development Kit
Spark Chamber
Distributed Log Debugger for Spark
1. Get all recent Spark Application IDs.
2. Get first exception, all exceptions grouped by types sorted by time, etc.
3. Display CPU, memory, I/O metrics.
4. Dive into a specific driver/executor/machine
5. Search
Features
Spark Chamber
Distributed Log Debugger for Spark
Developer mode: debug developer’s own Spark job.
SRE mode: view and check all users’ Spark job information.
Security
YARN aggregates log files on HDFS Files are named after host names
All application IDs of the same user are
under same place.
One machine has one log file, regardless
of # executors on that machine.
Spark Chamber
Enable Yarn Log Aggregation
/ tmp / logs / username / logs // tmp / logs / username /
username
username
username
username
username
username
username
username
username
username
username
username
username
username
Spark Chamber
Use Spark to debug Spark
Extend the Spark Shell by Hooks:
1. For ONE application Id, distribute log files to different executors.
2. Extract each lines and save into DataFrame.
3. Sort log dataframe by time and hostname.
4. Retrieve target log via SparkSQL DataFrame APIs.
• Geo-spatial processing
• SCBuilder
• Kafka dispersal
• SparkChamber
• SparkPlug
• SparkChamber
Engineers SRE
API
Tools
Future
Work
Spark Chamber
SRE version - Cluster wide insights
● Dimensions - Jobs
○ All
○ Single team
○ Single engineer
● Dimensions - Time
○ Last month, week, day
● Dimensions - Hardware
○ Specific rack, pod
Spark Chamber
SRE version - Analytics and Machine Learning
● Analytics
○ Resource auditing
○ Data access auditing
● Machine Learning
○ Failures diagnostics
○ Malicious jobs detection
○ Performance optimization
• Geo-spatial processing
• SCBuilder
• Kafka dispersal
• Hive table registration (Didn’t cover today)
• Incremental processing (Didn’t cover
today)
• Debug logging
• Metrics
• Configurations
• Data Freshness
• Resource usage
• SparkChamber
• SparkPlug
• Unit testing (Didn’t cover today)
• Oozie integration (Didn’t cover today)
• SparkChamber
• Resource usage auditing
• Data access auditing
• Machine learning on jobs
Engineers SRE
API
Tools
Future
Work
Today, Tuesday, June 7
4:50 PM – 5:20 PM
Room: Ballroom B
SPARK: INTERACTIVE TO PRODUCTION
Dara Adib, Uber
Tomorrow, Wednesday, June 8
5:25 PM – 5:55 PM
Room: Imperial
Locality Sensitive Hashing by Spark
Alain Rodriguez, Fraud Platform, Uber
Kelvin Chu, Hadoop Platform, Uber
Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be
reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or
by any information storage or retrieval systems, without permission in writing from Uber. This document is intended
only for the use of the individual or entity to whom it is addressed and contains information that is privileged,
confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified
that the information contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person
other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.

More Related Content

PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Huawei Advanced Data Science With Spark Streaming
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
PDF
Spark Summit EU talk by Yiannis Gkoufas
PDF
Spark Summit EU talk by Michael Nitschinger
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
PDF
Spark Summit EU talk by Rolf Jagerman
PDF
Scaling Machine Learning To Billions Of Parameters
Spark Summit EU talk by Kaarthik Sivashanmugam
Huawei Advanced Data Science With Spark Streaming
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Michael Nitschinger
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Summit EU talk by Rolf Jagerman
Scaling Machine Learning To Billions Of Parameters

What's hot (20)

PDF
Spark Summit EU talk by John Musser
PDF
Spark Summit EU talk by Oscar Castaneda
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PDF
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
PDF
Spark Summit EU talk by Jim Dowling
PDF
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
Spark Summit EU talk by Stephan Kessler
PDF
Spark Summit EU talk by Simon Whitear
PDF
An Introduction to Sparkling Water by Michal Malohlava
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Databricks: What We Have Learned by Eating Our Dog Food
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
PDF
Apache Spark vs Apache Flink
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit EU talk by John Musser
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit EU talk by Jim Dowling
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Simon Whitear
An Introduction to Sparkling Water by Michal Malohlava
Efficient State Management With Spark 2.0 And Scale-Out Databases
Databricks: What We Have Learned by Eating Our Dog Food
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Lambda architecture on Spark, Kafka for real-time large scale ML
Apache Spark vs Apache Flink
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Ad

Viewers also liked (20)

PDF
Airstream: Spark Streaming At Airbnb
PDF
Low Latency Execution For Apache Spark
PDF
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
PDF
Huohua: A Distributed Time Series Analysis Framework For Spark
PDF
Big Data in Production: Lessons from Running in the Cloud
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Spatial Analysis On Histological Images Using Spark
PDF
Spark at Bloomberg: Dynamically Composable Analytics
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Spark on Mesos
PDF
Re-Architecting Spark For Performance Understandability
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PDF
Morticia: Visualizing And Debugging Complex Spark Workflows
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
From MapReduce to Apache Spark
PDF
Spark Uber Development Kit
Airstream: Spark Streaming At Airbnb
Low Latency Execution For Apache Spark
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Huohua: A Distributed Time Series Analysis Framework For Spark
Big Data in Production: Lessons from Running in the Cloud
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Spatial Analysis On Histological Images Using Spark
Spark at Bloomberg: Dynamically Composable Analytics
Spark And Cassandra: 2 Fast, 2 Furious
Spark on Mesos
Re-Architecting Spark For Performance Understandability
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Morticia: Visualizing And Debugging Complex Spark Workflows
Time-Evolving Graph Processing On Commodity Clusters
From MapReduce to Apache Spark
Spark Uber Development Kit
Ad

Similar to Spark Uber Development Kit (20)

PDF
Hive on Spark, production experience @Uber
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Spark Meetup at Uber
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
ETL with SPARK - First Spark London meetup
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPTX
Profiling & Testing with Spark
PDF
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
PPTX
YARN Ready: Apache Spark
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
PPTX
Spark Summit EMEA - Arun Murthy's Keynote
PPTX
Spark and Hadoop Perfect Togeher by Arun Murthy
PDF
Productionalizing a spark application
PDF
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
PDF
Big data workloads using Apache Sparkon HDInsight
PDF
Apache Spark Overview @ ferret
PPTX
October 2014 HUG : Hive On Spark
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Apache Spark at Viadeo
Hive on Spark, production experience @Uber
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Meetup at Uber
Intro to Apache Spark by CTO of Twingo
Simplifying Big Data Analytics with Apache Spark
ETL with SPARK - First Spark London meetup
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Profiling & Testing with Spark
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
YARN Ready: Apache Spark
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Spark Summit EMEA - Arun Murthy's Keynote
Spark and Hadoop Perfect Togeher by Arun Murthy
Productionalizing a spark application
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Big data workloads using Apache Sparkon HDInsight
Apache Spark Overview @ ferret
October 2014 HUG : Hive On Spark
Spark Under the Hood - Meetup @ Data Science London
Apache Spark at Viadeo

More from Jen Aman (18)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Deploying Accelerators At Datacenter Scale Using Spark
PDF
Re-Architecting Spark For Performance Understandability
PDF
Livy: A REST Web Service For Apache Spark
PDF
GPU Computing With Apache Spark And Python
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PDF
EclairJS = Node.Js + Apache Spark
PDF
Spark: Interactive To Production
PDF
High-Performance Python On Spark
PDF
Scalable Deep Learning Platform On Spark In Baidu
PDF
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
PDF
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
PDF
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Snorkel: Dark Data and Machine Learning with Christopher Ré
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
RISELab:Enabling Intelligent Real-Time Decisions
A Graph-Based Method For Cross-Entity Threat Detection
Deploying Accelerators At Datacenter Scale Using Spark
Re-Architecting Spark For Performance Understandability
Livy: A REST Web Service For Apache Spark
GPU Computing With Apache Spark And Python
Building Custom Machine Learning Algorithms With Apache SystemML
EclairJS = Node.Js + Apache Spark
Spark: Interactive To Production
High-Performance Python On Spark
Scalable Deep Learning Platform On Spark In Baidu
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Utilizing Human Data Validation For KPI Analysis And Machine Learning

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Microsoft Core Cloud Services powerpoint
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
How to run a consulting project- client discovery
PDF
Introduction to the R Programming Language
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PDF
[EN] Industrial Machine Downtime Prediction
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
climate analysis of Dhaka ,Banglades.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck
Microsoft Core Cloud Services powerpoint
Optimise Shopper Experiences with a Strong Data Estate.pdf
A Complete Guide to Streamlining Business Processes
STERILIZATION AND DISINFECTION-1.ppthhhbx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Data Science and Data Analysis
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Data_Analytics_and_PowerBI_Presentation.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
How to run a consulting project- client discovery
Introduction to the R Programming Language
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
[EN] Industrial Machine Downtime Prediction
ISS -ESG Data flows What is ESG and HowHow
climate analysis of Dhaka ,Banglades.pptx

Spark Uber Development Kit