SlideShare a Scribd company logo
APACHE SPARK: HANDS ON
Andy Grove, Chief Architect
Dan Lynn, CEO
FOLLOW ALONG!
• Download IntelliJ Community Edition
• http://tiny.cc/get-intellij
• Snag our example code
• http://tiny.cc/agildata-spark
• git clone git@github.com:codefutures/apache-spark-examples.git
Andy Grove
Co-Founder & Chief Architect
Co-Founder @ Orbware
Technologies (acquired 2000)
Inventor of Firestorm/DAO



andy@agildata.com
• Providers of dbShards
• Relational Database Scaling

• Big Data Consulting
• Data Strategy
• Data Architecture Reviews
• Big Data Training
• Solution Implementation
• Distributed over 6 states!
• Headquartered in Broomfield, CO
www.agildata.com
Dan Lynn
CEO
Co-Founder @ FullContact
15 years building software
Techstars 2011



dan@agildata.com
AGENDA
• Part I - Overview of Spark
• Motivation, APIs, Ecosystem, Simple Example
• Part 2 - Hands On
• Work through a real data problem
PART 1: AN OVERVIEW OF SPARK
A BRIEF HISTORY LESSON
• First there was Hadoop
• Goal: Process petabytes of constantly-growing data
• “Move the processing to the data”
• But MapReduce was difficult to program
• So they made Pig, Hive, Cascading, etc…
A BRIEF HISTORY LESSON
• MapReduce was also very reliable
• But it performed poorly on iterative tasks like machine
learning.
• So in 2009, UC Berkeley started on an new approach
• Keeping data in memory as much as possible.
A BRIEF HISTORY LESSON
• They called it “Spark”
• After lots of community acceptance it became an Apache Project
in 2013.
• Since then, it has gained mainstream acceptance.
• “Potentially the Most Significant Open Source Project of the Next
Decade” - IBM, June 15, 2015
A BRIEF HISTORY LESSON
• Huge ecosystem
• Machine learning: MLlib, Mahout
• Graph processing: GraphX
• Read from / write to anything that Hadoop can
• Tons of community contributions: spark-packages.org
• Zeppelin: Python-style interactive notebooks
CONCEPTS
CONCEPTS - RDD
RDD aka “Resilient Distributed Dataset”
your_data
f(your_data)
g(f(your_data))
<— an RDD
<— also an RDD
<— so is this
RDD - SECRET INTERNALS!!!11
/**
* Tells the Spark framework *where* the data is.
*/
protected Partition[] getPartitions();
/**
* Iterates through the data for a given partition.
*/
Iterator<T> compute(Partition split, TaskContext context);
RDD - PUBLIC API
• Transformations
• Make new RDDs by applying transformation functions.
• Actions
• Write to HDFS, write to databases, yield an answer, etc…
Two Options
RDD - PUBLIC API
• Transformations
• .map(func) .filter(func) .reduce(func) .flatMap(func)
• Actions
• .collect() .saveAsTextFile(path) .sample(…) .take(n)
EXECUTION MODEL
SPARK EXECUTION MODEL
https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
What’s this?
SPARK EXECUTION MODEL
• Cluster Managers
• Apache Mesos
• YARN (aka Hadoop 2.0)
• Spark’s native cluster manager
NEW(ER) SPARK APIs
SPARK SQL / DATAFRAME API
• New in Spark 1.3. The core engine behind Spark SQL
• If RDDs are transformations that apply to JVM objects…
• Schema (i.e. the class) is passed along with each datum
• Serialization pain. GC pain.
• …then DataFrames are transformations that apply to data
• Schema is defined for the entire set
• Data is transmitted independent of schema. JVM data access incurs much less GC overhead
• DataFrames have more optimized execution logic. i.e. a query planner
DATASET API
• New in Spark 1.6
• Addressed specific deficiencies in DataFrames
• DataFrames lack compile-time type-checking.
• Datasets look like RDDs, but perform like DataFrames
SPARK API CHOICES
Java Scala
RDD
DataFrame sketchy…
Spark SQL
Dataset exciting, but very new exciting, but very new
QUICK EXAMPLE
• Let’s count Shakespeare’s favorite words!
PART 2: HANDS ON
PART 2: HANDS ON
• The problem: Rank Colorado counties by gender ratio.
• The data: US census data from 2010
• The approach:
• RDD API (in both Java 8 and Scala)
• DataFrame API / Spark SQL
• Dataset API
REFERENCES
• http://spark.apache.org/research.html
• http://tiny.cc/agildata-spark
• http://spark-packages.org
Andy Grove
Co-Founder & Chief Architect
andy@agildata.com
@andygrove73
www.agildata.com
Dan Lynn
CEO
dan@agildata.com
@danklynn
Thanks!

More Related Content

PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
PDF
Deep dive into spark streaming
PDF
Introduction to Apache Spark
PDF
Tuning and Debugging in Apache Spark
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
Introduction to Apache Spark
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
PDF
Spark Meetup at Uber
Dirty data? Clean it up! - Datapalooza Denver 2016
Deep dive into spark streaming
Introduction to Apache Spark
Tuning and Debugging in Apache Spark
Apache Spark: The Next Gen toolset for Big Data Processing
Introduction to Apache Spark
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Spark Meetup at Uber

What's hot (20)

PDF
Introduction to apache spark
PDF
Introduction to near real time computing
PPTX
Introduction to Apache Spark
PDF
Introduction to apache spark
PDF
Introduction to Apache Spark
PDF
An Overview of Apache Spark
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PPTX
Apache spark - History and market overview
PDF
Hadoop and Spark
PDF
Adding Complex Data to Spark Stack by Tug Grall
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Scale-Out Using Spark in Serverless Herd Mode!
PDF
Apache Spark Usage in the Open Source Ecosystem
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
PPTX
Spark tutorial
PPTX
Unlocking Your Hadoop Data with Apache Spark and CDH5
PDF
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Introduction to apache spark
Introduction to near real time computing
Introduction to Apache Spark
Introduction to apache spark
Introduction to Apache Spark
An Overview of Apache Spark
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Real time data viz with Spark Streaming, Kafka and D3.js
Apache spark - History and market overview
Hadoop and Spark
Adding Complex Data to Spark Stack by Tug Grall
SparkSQL: A Compiler from Queries to RDDs
Scale-Out Using Spark in Serverless Herd Mode!
Apache Spark Usage in the Open Source Ecosystem
Unified Big Data Processing with Apache Spark (QCON 2014)
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Spark tutorial
Unlocking Your Hadoop Data with Apache Spark and CDH5
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Ad

Viewers also liked (20)

PDF
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
PDF
Data decay and the illusion of the present
PDF
The Holy Grail of Data Analytics
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PDF
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
PDF
2016 spark survey
PDF
Spark SQL | Apache Spark
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
PDF
An introduction To Apache Spark
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PDF
PySpark Best Practices
PDF
Distributed ML in Apache Spark
PPTX
5 things one must know about spark!
PDF
Performance of Spark vs MapReduce
PPTX
Introduction to Apache Spark and MLlib
PDF
Machine Learning with Spark MLlib
PPTX
Online Tweet Sentiment Analysis with Apache Spark
PDF
PySpark in practice slides
PPTX
Programming in Spark using PySpark
PDF
Apache Spark Tutorial
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
Data decay and the illusion of the present
The Holy Grail of Data Analytics
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
2016 spark survey
Spark SQL | Apache Spark
Frustration-Reduced PySpark: Data engineering with DataFrames
An introduction To Apache Spark
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Best Practices
Distributed ML in Apache Spark
5 things one must know about spark!
Performance of Spark vs MapReduce
Introduction to Apache Spark and MLlib
Machine Learning with Spark MLlib
Online Tweet Sentiment Analysis with Apache Spark
PySpark in practice slides
Programming in Spark using PySpark
Apache Spark Tutorial
Ad

Similar to Hands on with Apache Spark (20)

PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Spark Worshop
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PPTX
Big Data Processing with Apache Spark 2014
PPTX
Paris Data Geek - Spark Streaming
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
PPTX
Apache Spark Fundamentals
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
20170126 big data processing
PPTX
Taboola Road To Scale With Apache Spark
PPTX
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
PDF
Apache Spark Performance is too hard. Let's make it easier
Jump Start on Apache Spark 2.2 with Databricks
Jump Start with Apache Spark 2.0 on Databricks
Spark Worshop
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Big Data Processing with Apache Spark 2014
Paris Data Geek - Spark Streaming
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Apache spark-melbourne-april-2015-meetup
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Apache Spark Fundamentals
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Intro to Apache Spark by CTO of Twingo
Apache spark sneha challa- google pittsburgh-aug 25th
20170126 big data processing
Taboola Road To Scale With Apache Spark
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark Performance is too hard. Let's make it easier

Recently uploaded (20)

PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Miokarditis (Inflamasi pada Otot Jantung)
ISS -ESG Data flows What is ESG and HowHow
Clinical guidelines as a resource for EBP(1).pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
.pdf is not working space design for the following data for the following dat...
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx
Database Infoormation System (DBIS).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction-to-Cloud-ComputingFinal.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
climate analysis of Dhaka ,Banglades.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Miokarditis (Inflamasi pada Otot Jantung)

Hands on with Apache Spark

  • 1. APACHE SPARK: HANDS ON Andy Grove, Chief Architect Dan Lynn, CEO
  • 2. FOLLOW ALONG! • Download IntelliJ Community Edition • http://tiny.cc/get-intellij • Snag our example code • http://tiny.cc/agildata-spark • git clone [email protected]:codefutures/apache-spark-examples.git
  • 3. Andy Grove Co-Founder & Chief Architect Co-Founder @ Orbware Technologies (acquired 2000) Inventor of Firestorm/DAO
 
 [email protected] • Providers of dbShards • Relational Database Scaling
 • Big Data Consulting • Data Strategy • Data Architecture Reviews • Big Data Training • Solution Implementation • Distributed over 6 states! • Headquartered in Broomfield, CO www.agildata.com Dan Lynn CEO Co-Founder @ FullContact 15 years building software Techstars 2011
 
 [email protected]
  • 4. AGENDA • Part I - Overview of Spark • Motivation, APIs, Ecosystem, Simple Example • Part 2 - Hands On • Work through a real data problem
  • 5. PART 1: AN OVERVIEW OF SPARK
  • 6. A BRIEF HISTORY LESSON • First there was Hadoop • Goal: Process petabytes of constantly-growing data • “Move the processing to the data” • But MapReduce was difficult to program • So they made Pig, Hive, Cascading, etc…
  • 7. A BRIEF HISTORY LESSON • MapReduce was also very reliable • But it performed poorly on iterative tasks like machine learning. • So in 2009, UC Berkeley started on an new approach • Keeping data in memory as much as possible.
  • 8. A BRIEF HISTORY LESSON • They called it “Spark” • After lots of community acceptance it became an Apache Project in 2013. • Since then, it has gained mainstream acceptance. • “Potentially the Most Significant Open Source Project of the Next Decade” - IBM, June 15, 2015
  • 9. A BRIEF HISTORY LESSON • Huge ecosystem • Machine learning: MLlib, Mahout • Graph processing: GraphX • Read from / write to anything that Hadoop can • Tons of community contributions: spark-packages.org • Zeppelin: Python-style interactive notebooks
  • 11. CONCEPTS - RDD RDD aka “Resilient Distributed Dataset” your_data f(your_data) g(f(your_data)) <— an RDD <— also an RDD <— so is this
  • 12. RDD - SECRET INTERNALS!!!11 /** * Tells the Spark framework *where* the data is. */ protected Partition[] getPartitions(); /** * Iterates through the data for a given partition. */ Iterator<T> compute(Partition split, TaskContext context);
  • 13. RDD - PUBLIC API • Transformations • Make new RDDs by applying transformation functions. • Actions • Write to HDFS, write to databases, yield an answer, etc… Two Options
  • 14. RDD - PUBLIC API • Transformations • .map(func) .filter(func) .reduce(func) .flatMap(func) • Actions • .collect() .saveAsTextFile(path) .sample(…) .take(n)
  • 17. SPARK EXECUTION MODEL • Cluster Managers • Apache Mesos • YARN (aka Hadoop 2.0) • Spark’s native cluster manager
  • 19. SPARK SQL / DATAFRAME API • New in Spark 1.3. The core engine behind Spark SQL • If RDDs are transformations that apply to JVM objects… • Schema (i.e. the class) is passed along with each datum • Serialization pain. GC pain. • …then DataFrames are transformations that apply to data • Schema is defined for the entire set • Data is transmitted independent of schema. JVM data access incurs much less GC overhead • DataFrames have more optimized execution logic. i.e. a query planner
  • 20. DATASET API • New in Spark 1.6 • Addressed specific deficiencies in DataFrames • DataFrames lack compile-time type-checking. • Datasets look like RDDs, but perform like DataFrames
  • 21. SPARK API CHOICES Java Scala RDD DataFrame sketchy… Spark SQL Dataset exciting, but very new exciting, but very new
  • 22. QUICK EXAMPLE • Let’s count Shakespeare’s favorite words!
  • 24. PART 2: HANDS ON • The problem: Rank Colorado counties by gender ratio. • The data: US census data from 2010 • The approach: • RDD API (in both Java 8 and Scala) • DataFrame API / Spark SQL • Dataset API
  • 26. Andy Grove Co-Founder & Chief Architect [email protected] @andygrove73 www.agildata.com Dan Lynn CEO [email protected] @danklynn Thanks!