SlideShare a Scribd company logo
Apache Spark
Lightening Fast Cluster Computing
Eric Mizell – Director, Solution Engineering
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Apache Spark?
Apache Open Source Project
Distributed Compute Engine
for fast and expressive data processing
Designed for Iterative, In-Memory
computations and interactive data mining
Expressive Multi-Language APIs
for Java, Scala, Python, and R
Powerful Abstractions
Enable data workers to rapidly iterate over
data for:
• ETL, Machine Learning, SQL, Stream Processing,
and Graph Processing
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why Spark?
Elegant Developer APIs
• Data Frames/SQL, Machine Learning, Graph algorithms and streaming
• Scala, Python, Java and R
• Single environment for pre-processing and Machine Learning
In-memory computation model
• Effective for iterative computations and machine learning
Machine Learning On Hadoop
• Implementation of distributed ML-algorithms
• Pipeline API (Spark ML)
Runs on Hadoop on YARN, Mesos, standalone
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Interactions with Spark
Command Line
• Scala shell – Scala/Java (./bin/spark-shell)
• Python - (./bin/pyspark)
Notebooks
• Apache Zeppelin Notebook
• Juptyer/IPython Notebook
• IRuby Notebook
ODBC/JDBC (Spark SQL only via Thrift)
• Simba driver
• DataDirect driver
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Features
Ad-hoc experimentation
Deeply integrated with Spark + Hadoop
Supports multiple language backends
Incubating at Apache
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Fundamental Abstraction: Resilient Distributed Datasets
RDD
Work with distributed collections as
primitives
RDD Properties
• Immutable collections of objects spread across
a cluster
• Built through parallel transformations (map,
filter, etc.)
• Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
Multiple Languages
broad developer, partner and customer
engagement
RDD
Partition 1
RDD
Partition 2
RDD
Partition 3Worker Node
Worker Node
Worker Node
RDD
LogicalSpark
Driver
sc = new SparkContext
rDD
=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
…
Developer
Physical
Writes
RDD
RDDs are collections of objects distributed across a cluster,
cached in RAM or on disk. They are built through parallel
transformations, automatically rebuilt on failure and immutable
(each transformation creates a new RDD).
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What can developers do with RDDs?
RDD Operations
Transformations
• e.g. map, filter, groupBy, join
• Lazy operations to build RDDs from other
RDDs
Actions
• e.g. count, collect, save
• Return a result or write it to storage
Other primitives
• Accumulator
• Broadcast Variables
Developer
Writes
RDD
Operations
Writes
Accumulator
s
Actions
Broadcast
Variables
Transformations
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: Mining Console Logs
Load error messages from a log into memory, then
interactively search for patterns
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
RDD
Demo
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL
SQL Access and Data Frames
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
Streaming
MLlib
Spark
SQL
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
YARN
HDFS
Spark SQL
Table Structure
integrated to work with tables and rows
Hive Queries via Spark
by Spark SQL Context can connect to Hive and
query Hive
Bindings
to Python, Scala, Java, and R
Data Frames
new abstractions simplifies and speeds up SQL
processing
Spark Core Engine
Spark SQL
Data Frame DSL Spark SQL
Data Frame API
Data Source API
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storage
What are Data Frames?
Data Frames represent data in RDDs as a Table
RDD is a low level abstraction
–Think of RDD as bytecode and DataFrame as the
Java Program
Data Frame Properties
–Data Frames attach schema to RDDs
–Allows users to perform aggressive query
optimizations
–Brings the power of SQL to RDDs!
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Tuple
Relational
View
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Frames are intuitive
RDD Example
Equivalent Data Frame Example
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by department?
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
DataFrame
Demo
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
Streaming
MLlib
Spark
SQL
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
MLlib
Machine Learning Library
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Machine Learning?
Machine learning is the study of
algorithms that learn concepts from
data.
A key aspect of learning is
generalization: how well a learning
algorithm is able to predict on unseen
examples.
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Machine Learning Primitives
Unsupervised Learning
Clustering (K-means)
Recommendation
Collaborative Filtering
- alternating least squares
Dimensionality Reductions
- Principal component analysis (PCA) and singular
value decomposition (SVD)
Supervised Learning
Classification
- Naïve Bayes, Decision Tree, Random Forest,
Gradient Boosted Trees
Regression
- linear, logistic and Support Vector Machines
(SVMs)
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ML Workflows are complex
Q-Q
Q-A
similarit
y
Log
Parsing,
Cleanin
g
Ad
category
mapping
Query
category
mapping
Poly
Exp
(Q-A)
Feature
s
Model
Linear
Solver
train
test
Metrics
• Feature Extraction
Feature
Extraction
Ad Server
Sponsored Search Advertising Pipeline Challenges:
-> specify pipeline
-> inspect and debug
-> tune hyperparameters
-> productionize
HDFS
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ML Pipeline makes ML workflows easier
Transformer
Transforms one dataset into another
Estimator
Fits model to data
Pipeline
Sequence of stages, consisting of estimators
or transformers
Parameters
Trait for components that take parameters
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Streaming
Real Time Stream Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
MLlib
Spark
Streaming
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark Streaming
• Spark Streaming is an extension of Spark-core API that supports scalable, high
throughput and fault-tolerant streaming applications.
• Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or
TCP sockets
• Data is processed using the now-familiar API: map, filter, reduce, join and window
• Processed data can be stored in databases, filesystems, or live dashboards
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
GraphX
Graph Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark GraphX Graph API on Spark
Seamlessly work with graphs and collections
Growing library of graph algorithms
• SVD++, Connected Components, Triangle
Count, …
Iterative Graph Computations using
Pregel
Implements Valiant’s Bulk Synchronous
Parallel (BSP) model for distributing graph
algorithms.
Use Case
Social Media: Suggest new connections based
on existing relationships
Networking: Best routing through a given
network
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Part. 2
Part. 1
Vertex Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables (RDDs)
D
Property Graph
B C
D
E
AA
F
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table (RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How to Get Started with Spark
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Try Spark Today
Download the Hortonworks Sandbox
http://hortonworks.com/products/hortonworks-sandbox/
Go to the Apache Spark Website
http://spark.apache.org/
Learn Spark
Build a Proof of Concept
Test New Functionality
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
© Hortonworks Inc. 2013
Thank You!
Eric Mizell - Director, Solutions Engineering
emizell@hortonworks.com

More Related Content

PPTX
Interactive Analytics using Apache Spark
PDF
NoSQL - Vital Open Source Ingredient for Modern Success
PDF
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
PPTX
Bringing complex event processing to Spark streaming
PDF
Streaming Sensor Data Slides_Virender
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
PPTX
Hadoop on Docker
PDF
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Interactive Analytics using Apache Spark
NoSQL - Vital Open Source Ingredient for Modern Success
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
Bringing complex event processing to Spark streaming
Streaming Sensor Data Slides_Virender
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Hadoop on Docker
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...

What's hot (20)

PDF
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
PPTX
Performance Comparison of Streaming Big Data Platforms
PDF
Hadoop Everywhere & Cloudbreak
PDF
OpenStack Scale-out Networking Architecture
PDF
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
PDF
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
PDF
Camel Riders in the Cloud
PPTX
Effective Spark on Multi-Tenant Clusters
PDF
20150716 introduction to apache spark v3
PDF
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
PDF
Cooperative Data Exploration with iPython Notebook
PPT
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
PPTX
OpenStack + Nano Server + Hyper-V + S2D
PDF
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
PDF
Streaming SQL
PDF
Introduction to Apache NiFi And Storm
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
Developing Real-Time Data Pipelines with Apache Kafka
Performance Comparison of Streaming Big Data Platforms
Hadoop Everywhere & Cloudbreak
OpenStack Scale-out Networking Architecture
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
Camel Riders in the Cloud
Effective Spark on Multi-Tenant Clusters
20150716 introduction to apache spark v3
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Cooperative Data Exploration with iPython Notebook
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OpenStack + Nano Server + Hyper-V + S2D
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Streaming SQL
Introduction to Apache NiFi And Storm
Ad

Viewers also liked (20)

PDF
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
PDF
Marketing is not all fluff; engineering is not all math
PDF
Trademarks and Your Free and Open Source Software Project
PDF
Women in Open Source
PPTX
Giving a URL to All Objects using Beacons²
PDF
Open Source Systems Administration
PPTX
Sustainable Open Data Markets
ODP
How Raleigh Became an Open Source City
PPTX
All Things Open Opening Keynote
PPT
Open Sourcing the Public Library
PDF
Software Development as a Civic Service
PDF
The Ember.js Framework - Everything You Need To Know
PPTX
Great Artists (Designers) Steal
PDF
What Academia Can Learn from Open Source
PPTX
JavaScript and Internet Controlled Hardware Prototyping
PPTX
Javascript - The Stack and Beyond
PDF
Open Source in Healthcare
PDF
Choosing a Javascript Framework
PDF
The Gurubox Project: Open Source Troubleshooting Tools
PPTX
Considerations for Operating an OpenStack Cloud
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
Marketing is not all fluff; engineering is not all math
Trademarks and Your Free and Open Source Software Project
Women in Open Source
Giving a URL to All Objects using Beacons²
Open Source Systems Administration
Sustainable Open Data Markets
How Raleigh Became an Open Source City
All Things Open Opening Keynote
Open Sourcing the Public Library
Software Development as a Civic Service
The Ember.js Framework - Everything You Need To Know
Great Artists (Designers) Steal
What Academia Can Learn from Open Source
JavaScript and Internet Controlled Hardware Prototyping
Javascript - The Stack and Beyond
Open Source in Healthcare
Choosing a Javascript Framework
The Gurubox Project: Open Source Troubleshooting Tools
Considerations for Operating an OpenStack Cloud
Ad

Similar to Apache Spark: Lightning Fast Cluster Computing (20)

PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
PDF
Spark mhug2
PPTX
Intro to Spark with Zeppelin
PPTX
Spark crash course workshop at Hadoop Summit
PPTX
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
PDF
Apache Spark Workshop at Hadoop Summit
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
PPTX
Apache Spark Crash Course
PPTX
Apache Spark Fundamentals
PPTX
Apache Spark Introduction @ University College London
PDF
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
PDF
Intro to Spark & Zeppelin - Crash Course - HS16SJ
PPTX
Unit II Real Time Data Processing tools.pptx
PDF
Hortonworks tech workshop in-memory processing with spark
PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PPTX
Spark core
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Spark For Plain Old Java Geeks (June2014 Meetup)
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Spark mhug2
Intro to Spark with Zeppelin
Spark crash course workshop at Hadoop Summit
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Apache Spark Workshop at Hadoop Summit
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Apache Spark Crash Course
Apache Spark Fundamentals
Apache Spark Introduction @ University College London
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Unit II Real Time Data Processing tools.pptx
Hortonworks tech workshop in-memory processing with spark
Intro to Apache Spark
Intro to Apache Spark
Spark core
Simplifying Big Data Analytics with Apache Spark
Spark For Plain Old Java Geeks (June2014 Meetup)

More from All Things Open (20)

PDF
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
PPTX
Big Data on a Small Budget: Scalable Data Visualization for the Rest of Us - ...
PDF
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
PDF
Let's Create a GitHub Copilot Extension! - Nick Taylor, Pomerium
PDF
Leveraging Pre-Trained Transformer Models for Protein Function Prediction - T...
PDF
Gen AI: AI Agents - Making LLMs work together in an organized way - Brent Las...
PDF
You Don't Need an AI Strategy, But You Do Need to Be Strategic About AI - Jes...
PPTX
DON’T PANIC: AI IS COMING – The Hitchhiker’s Guide to AI - Mark Hinkle, Perip...
PDF
Fine-Tuning Large Language Models with Declarative ML Orchestration - Shivay ...
PDF
Leveraging Knowledge Graphs for RAG: A Smarter Approach to Contextual AI Appl...
PPTX
Artificial Intelligence Needs Community Intelligence - Sriram Raghavan, IBM R...
PDF
Don't just talk to AI, do more with AI: how to improve productivity with AI a...
PPTX
Open-Source GenAI vs. Enterprise GenAI: Navigating the Future of AI Innovatio...
PDF
The Death of the Browser - Rachel-Lee Nabors, AgentQL
PDF
Making Operating System updates fast, easy, and safe
PDF
Reshaping the landscape of belonging to transform community
PDF
The Unseen, Underappreciated Security Work Your Maintainers May (or may not) ...
PDF
Integrating Diversity, Equity, and Inclusion into Product Design
PDF
The Open Source Ecosystem for eBPF in Kubernetes
PDF
Open Source Privacy-Preserving Metrics - Sarah Gran & Brandon Pitman
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
Big Data on a Small Budget: Scalable Data Visualization for the Rest of Us - ...
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
Let's Create a GitHub Copilot Extension! - Nick Taylor, Pomerium
Leveraging Pre-Trained Transformer Models for Protein Function Prediction - T...
Gen AI: AI Agents - Making LLMs work together in an organized way - Brent Las...
You Don't Need an AI Strategy, But You Do Need to Be Strategic About AI - Jes...
DON’T PANIC: AI IS COMING – The Hitchhiker’s Guide to AI - Mark Hinkle, Perip...
Fine-Tuning Large Language Models with Declarative ML Orchestration - Shivay ...
Leveraging Knowledge Graphs for RAG: A Smarter Approach to Contextual AI Appl...
Artificial Intelligence Needs Community Intelligence - Sriram Raghavan, IBM R...
Don't just talk to AI, do more with AI: how to improve productivity with AI a...
Open-Source GenAI vs. Enterprise GenAI: Navigating the Future of AI Innovatio...
The Death of the Browser - Rachel-Lee Nabors, AgentQL
Making Operating System updates fast, easy, and safe
Reshaping the landscape of belonging to transform community
The Unseen, Underappreciated Security Work Your Maintainers May (or may not) ...
Integrating Diversity, Equity, and Inclusion into Product Design
The Open Source Ecosystem for eBPF in Kubernetes
Open Source Privacy-Preserving Metrics - Sarah Gran & Brandon Pitman

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Cloud computing and distributed systems.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Chapter 3 Spatial Domain Image Processing.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
sap open course for s4hana steps from ECC to s4
Programs and apps: productivity, graphics, security and other tools
Review of recent advances in non-invasive hemoglobin estimation
Assigned Numbers - 2025 - Bluetooth® Document
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Cloud computing and distributed systems.
Diabetes mellitus diagnosis method based random forest with bat algorithm
A comparative analysis of optical character recognition models for extracting...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Chapter 3 Spatial Domain Image Processing.pdf
The AUB Centre for AI in Media Proposal.docx

Apache Spark: Lightning Fast Cluster Computing

  • 1. Apache Spark Lightening Fast Cluster Computing Eric Mizell – Director, Solution Engineering
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Apache Spark? Apache Open Source Project Distributed Compute Engine for fast and expressive data processing Designed for Iterative, In-Memory computations and interactive data mining Expressive Multi-Language APIs for Java, Scala, Python, and R Powerful Abstractions Enable data workers to rapidly iterate over data for: • ETL, Machine Learning, SQL, Stream Processing, and Graph Processing Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL Spark Streaming MLlib
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why Spark? Elegant Developer APIs • Data Frames/SQL, Machine Learning, Graph algorithms and streaming • Scala, Python, Java and R • Single environment for pre-processing and Machine Learning In-memory computation model • Effective for iterative computations and machine learning Machine Learning On Hadoop • Implementation of distributed ML-algorithms • Pipeline API (Spark ML) Runs on Hadoop on YARN, Mesos, standalone
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Interactions with Spark Command Line • Scala shell – Scala/Java (./bin/spark-shell) • Python - (./bin/pyspark) Notebooks • Apache Zeppelin Notebook • Juptyer/IPython Notebook • IRuby Notebook ODBC/JDBC (Spark SQL only via Thrift) • Simba driver • DataDirect driver
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Introducing Apache Zeppelin Web-based Notebook for interactive analytics Features Ad-hoc experimentation Deeply integrated with Spark + Hadoop Supports multiple language backends Incubating at Apache Use Case Data exploration and discovery Visualization Interactive snippet-at-a-time experience “Modern Data Science Studio”
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Fundamental Abstraction: Resilient Distributed Datasets RDD Work with distributed collections as primitives RDD Properties • Immutable collections of objects spread across a cluster • Built through parallel transformations (map, filter, etc.) • Automatically rebuilt on failure • Controllable persistence (e.g. caching in RAM) Multiple Languages broad developer, partner and customer engagement RDD Partition 1 RDD Partition 2 RDD Partition 3Worker Node Worker Node Worker Node RDD LogicalSpark Driver sc = new SparkContext rDD =sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map … Developer Physical Writes RDD RDDs are collections of objects distributed across a cluster, cached in RAM or on disk. They are built through parallel transformations, automatically rebuilt on failure and immutable (each transformation creates a new RDD).
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What can developers do with RDDs? RDD Operations Transformations • e.g. map, filter, groupBy, join • Lazy operations to build RDDs from other RDDs Actions • e.g. count, collect, save • Return a result or write it to storage Other primitives • Accumulator • Broadcast Variables Developer Writes RDD Operations Writes Accumulator s Actions Broadcast Variables Transformations
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Example: Mining Console Logs Load error messages from a log into memory, then interactively search for patterns
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved RDD Demo
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL SQL Access and Data Frames YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark Streaming MLlib Spark SQL
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved YARN HDFS Spark SQL Table Structure integrated to work with tables and rows Hive Queries via Spark by Spark SQL Context can connect to Hive and query Hive Bindings to Python, Scala, Java, and R Data Frames new abstractions simplifies and speeds up SQL processing Spark Core Engine Spark SQL Data Frame DSL Spark SQL Data Frame API Data Source API
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storage What are Data Frames? Data Frames represent data in RDDs as a Table RDD is a low level abstraction –Think of RDD as bytecode and DataFrame as the Java Program Data Frame Properties –Data Frames attach schema to RDDs –Allows users to perform aggressive query optimizations –Brings the power of SQL to RDDs! dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Tuple Relational View Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog
  • 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Frames are intuitive RDD Example Equivalent Data Frame Example dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Find average age by department?
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DataFrame Demo YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark Streaming MLlib Spark SQL
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved MLlib Machine Learning Library YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL Spark Streaming MLlib
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Machine Learning? Machine learning is the study of algorithms that learn concepts from data. A key aspect of learning is generalization: how well a learning algorithm is able to predict on unseen examples.
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Machine Learning Primitives Unsupervised Learning Clustering (K-means) Recommendation Collaborative Filtering - alternating least squares Dimensionality Reductions - Principal component analysis (PCA) and singular value decomposition (SVD) Supervised Learning Classification - Naïve Bayes, Decision Tree, Random Forest, Gradient Boosted Trees Regression - linear, logistic and Support Vector Machines (SVMs)
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ML Workflows are complex Q-Q Q-A similarit y Log Parsing, Cleanin g Ad category mapping Query category mapping Poly Exp (Q-A) Feature s Model Linear Solver train test Metrics • Feature Extraction Feature Extraction Ad Server Sponsored Search Advertising Pipeline Challenges: -> specify pipeline -> inspect and debug -> tune hyperparameters -> productionize HDFS
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ML Pipeline makes ML workflows easier Transformer Transforms one dataset into another Estimator Fits model to data Pipeline Sequence of stages, consisting of estimators or transformers Parameters Trait for components that take parameters
  • 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Streaming Real Time Stream Processing YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL MLlib Spark Streaming
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Spark Streaming • Spark Streaming is an extension of Spark-core API that supports scalable, high throughput and fault-tolerant streaming applications. • Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or TCP sockets • Data is processed using the now-familiar API: map, filter, reduce, join and window • Processed data can be stored in databases, filesystems, or live dashboards
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved GraphX Graph Processing YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine Spark SQL Spark Streaming MLlib GraphX
  • 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Spark GraphX Graph API on Spark Seamlessly work with graphs and collections Growing library of graph algorithms • SVD++, Connected Components, Triangle Count, … Iterative Graph Computations using Pregel Implements Valiant’s Bulk Synchronous Parallel (BSP) model for distributing graph algorithms. Use Case Social Media: Suggest new connections based on existing relationships Networking: Best routing through a given network
  • 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Part. 2 Part. 1 Vertex Table (RDD) B C A D F E A D Distributed Graphs as Tables (RDDs) D Property Graph B C D E AA F Edge Table (RDD) A B A C C D B C A E A F E F E D B C D E A F Routing Table (RDD) B C D E A F 1 2 1 2 1 2 1 2 2D Vertex Cut Heuristic
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved How to Get Started with Spark
  • 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Try Spark Today Download the Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/ Go to the Apache Spark Website http://spark.apache.org/ Learn Spark Build a Proof of Concept Test New Functionality
  • 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2013 Thank You! Eric Mizell - Director, Solutions Engineering [email protected]

Editor's Notes

  • #3: NEED SPEAKER NOTES
  • #4: NEED SPEAKER NOTES
  • #5: NEED SPEAKER NOTES
  • #6: TALK TRACK Ad-hoc experimentation Spark, Hive, Shell, Flink, Tajo, Ignite, Lens, etc Deeply integrated with Spark + Hadoop Can be managed via Ambari Stacks Supports multiple language backends Pluggable “Interpreters” Incubating at Apache 100% open source and open community [NEXT SLIDE]
  • #7: TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pd
  • #8: TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pd
  • #9: Key idea: add “variables” to the “functions” in functional programming
  • #10: NEED SPEAKER NOTES
  • #11: NEED SPEAKER NOTES
  • #12: NEED SPEAKER NOTES
  • #14: Spark DataFrames represent tabular Data
  • #15: NEED SPEAKER NOTES
  • #16: NEED SPEAKER NOTES
  • #17: NEED SPEAKER NOTES
  • #18: TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE]
  • #20: TALK TRACK [NEXT SLIDE]
  • #21: NEED SPEAKER NOTES
  • #23: NEED SPEAKER NOTES
  • #24: TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] [RESOURCES] A vertex is an entity that can bring a bag of data (generally small) An edge connects vertices and can also own a bag of data https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf
  • #28: Takeaways Change order of interoperability slide Flush out no lock-in slide to talk about “proprietary open source”