SlideShare a Scribd company logo
Navid Kalaei
Fatemeh Jamali
Dr. Esmaeili
Winter 2018
Contents:
• Spark
• Frameworks
• Ecosystem
• Resilient Distributed
Datasets(RDD)
• A Simplified Data
Flow
• Executors
• Iterative Operations
• Fault-tolerance
• Comparisons
• Who uses Spark?!
• Datasets
• DataFrame
• Scala
• Practices
• Pi Estimation
• Spark Stream
• Practice
• Compile and Deploy
• Spark SQL
• PageView
• References
2
Spark
 Apache Spark™ is a unified analytics engine for
large-scale data processing.
 Created by AMPLab now Databricks
 Written in Scala
 Licensed under Apache
 Lives in Github
3
Frameworks
4
Ecosystem
Hadoop Spark
Hive SparkSQL
Apache Mahout MLLib
Impala SparkSQL
Apache Giraph Graphax
Apache Storm Spark streaming
5
Resilient Distributed
Datasets
 RDD is a fundamental data structure of Spark
stored in memory.
 It is an immutable distributed collection of objects.
 Each dataset in RDD is divided into logical
partitions, which may be computed on different
nodes of the cluster.
6
Resilient Distributed
Datasets
 RDDs can contain any type of Python, Java, or
Scala objects, including user-defined classes.
 Formally, an RDD is a read-only, partitioned
collection of records.
 RDD is a fault-tolerant collection of elements that
can be operated on in parallel.
7
A Simplified Data Flow
8
Executors
9
Iterative Operations
MapRedu
ce
Spark
10
Fault-tolerance
 RDDs are remember
the sequence of
operations that
created it from the
original fault-tolerant
input data
 Batches of input data
are replicated in
memory of multiple
worker nodes,
therefore fault-tolerant
 Data lost due to
worker failure, can be
11
Breaking the Records!
Hadoop MR
Record
Spark
Record
Spark
1 PB
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Cluster disk
throughput
3150 GB/s
(est.)
618 GB/s 570 GB/s
Sort Benchmark
Daytona Rules
Yes Yes No
Network dedicated data
center, 10Gbps
virtualized
(EC2) 10Gbps
network
virtualized
(EC2) 10Gbps
network
Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
12
Performance
Word
Count
13
Performance
PageRan
k
14
Salary Comparison
15
Who uses Spark?!
16
Spark SQL and DataSet
 Spark SQL is a Spark module for structured data
processing
 Spark SQL uses this extra information to perform
extra optimizations
 Dataset is a new interface that provides the
benefits of RDDs with the benefits of Spark SQL’s
optimized engine
17
DataFrame
 A DataFrame is a Dataset organized into named
columns
 It is conceptually equivalent to a table in a
relational database or a data frame in R/Python
 It benefits from richer optimizations under the
hood
18
Scala
 Scala combines object-oriented and functional
programming in one concise high-level language.
 Scala's static types help avoid bugs in complex
applications
 Its JVM and JavaScript runtimes let you build
high performance systems
 It has an easy access to huge ecosystems of
libraries.
19
Practices
 WordCount
 Pi Estimation
 Text Search
20
Pi Estimation
∏
4
1
1
21
What is streaming?
22
Spark Stream
 Framework for large scale stream processing
 Scales to 100s of nodes
 Provides a simple batch-like API for
implementing complex algorithm
23
Stream Processing
24
 Run a streaming computation as a series of
very small, deterministic batch jobs
 Chop up the live stream into batches of X
seconds
 Spark treats each batch of data as RDDs and
processes them using RDD operations
Practices
25
 Run “Stateless NetworkWordCount”
 Compile Deploy a java file
 Run “Stateful NetworkWordCount”
 Run “PageView” and it’s generator
 Execute simple data operations with Spark SQL
Compile and Deploy
26
Compile
1. Generate project with Maven
2. Copy the Java file to the end point
3. Edit the pom.xml
4. Package the project
Deploy
1. Submit the .jar file to Spark
Apache Maven
27
 Maven is a tool that can now be used for building
and managing any Java-based project
 Making the build process easy
 Providing a uniform build system
 Providing quality project information
 Providing guidelines for best practices
development
 Allowing transparent migration to new features
pom.xml
28
 Contains project identifiers
 Defines source and target versions of Java to be
used
 Dependencies are subjected from Maven’s
repository
Let’s Compile and Deploy!
29
PageView
 The aim is to analyze hit and miss ratio of a
website
 The generator simulates 100 users from 2
regions on 10 threads
 Hit rate: 95%
30
100Users
• 94709
• 94117
Zip
Codes
• /index
0.7
• /news
0.2
• /contact
0.1
Pages
Spark SQL
31
 The Spark master node connects to relational databases and
loads data from a specific table or using a specific SQL query.
 The Spark master node distributes data to worker nodes for
transformation
 The Worker node connects to the relational database and writes
data
 The user can choose to use row-by-row insertion or bulk insert
References
 [1]: Apache Spark officially sets a new record in
large-scale sorting
 [2]: 2014 Data Science Salary Survey
 [3]: The Performance Comparison of Hadoop and
Spark
 [4] Apache Maven Project
 [5] The Scala Programming Language
32

More Related Content

PPTX
Azure Databricks is Easier Than You Think
PPTX
Azure data bricks by Eugene Polonichko
PDF
Introduction to Azure Data Lake
PPTX
Delta Lake with Azure Databricks
PDF
Owning Your Own (Data) Lake House
PPT
Business Intelligence with SQL Server
PDF
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Azure Databricks is Easier Than You Think
Azure data bricks by Eugene Polonichko
Introduction to Azure Data Lake
Delta Lake with Azure Databricks
Owning Your Own (Data) Lake House
Business Intelligence with SQL Server
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Making Data Timelier and More Reliable with Lakehouse Technology

What's hot (20)

PPTX
Delta lake and the delta architecture
PDF
Achieving Lakehouse Models with Spark 3.0
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PPTX
Azure Lowlands: An intro to Azure Data Lake
PDF
201905 Azure Databricks for Machine Learning
PPTX
What's new in SQL Server 2017
PPTX
Implement SQL Server on an Azure VM
PDF
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
PPTX
An intro to Azure Data Lake
PDF
Exploring sql server 2016
PPTX
Introducing Azure SQL Database
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PPTX
Azure data platform overview
PPTX
What’s new in SQL Server 2017
PPTX
Microsoft Azure Databricks
PDF
Dipping Your Toes: Azure Data Lake for DBAs
PDF
Lessons from Large-Scale Cloud Software at Databricks
PPTX
RDX Insights Presentation - Microsoft Business Intelligence
PPTX
Deep Dive into Azure Data Factory v2
Delta lake and the delta architecture
Achieving Lakehouse Models with Spark 3.0
Scaling your Data Pipelines with Apache Spark on Kubernetes
Azure Lowlands: An intro to Azure Data Lake
201905 Azure Databricks for Machine Learning
What's new in SQL Server 2017
Implement SQL Server on an Azure VM
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
An intro to Azure Data Lake
Exploring sql server 2016
Introducing Azure SQL Database
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Azure data platform overview
What’s new in SQL Server 2017
Microsoft Azure Databricks
Dipping Your Toes: Azure Data Lake for DBAs
Lessons from Large-Scale Cloud Software at Databricks
RDX Insights Presentation - Microsoft Business Intelligence
Deep Dive into Azure Data Factory v2
Ad

Similar to Spark (20)

PPTX
Spark Workshop
PDF
Spark after Dark by Chris Fregly of Databricks
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
PPTX
Apache Spark Fundamentals
PDF
Unified Big Data Processing with Apache Spark
PPTX
Spark Unveiled Essential Insights for All Developers
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPTX
Spark from the Surface
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PPTX
In Memory Analytics with Apache Spark
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Spark 101 - First steps to distributed computing
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Introduction to Spark - DataFactZ
PDF
Introduction to apache spark
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PPTX
Glint with Apache Spark
PDF
Liferay & Big Data Dev Con 2014
PDF
Started with-apache-spark
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
Spark Workshop
Spark after Dark by Chris Fregly of Databricks
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Apache Spark Fundamentals
Unified Big Data Processing with Apache Spark
Spark Unveiled Essential Insights for All Developers
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Spark from the Surface
Apache Spark: The Next Gen toolset for Big Data Processing
In Memory Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Spark 101 - First steps to distributed computing
Intro to Apache Spark by CTO of Twingo
Introduction to Spark - DataFactZ
Introduction to apache spark
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Glint with Apache Spark
Liferay & Big Data Dev Con 2014
Started with-apache-spark
Sa introduction to big data pipelining with cassandra & spark west mins...
Ad

Recently uploaded (20)

PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Leprosy and NLEP programme community medicine
PPTX
modul_python (1).pptx for professional and student
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Global Data and Analytics Market Outlook Report
PPTX
Database Infoormation System (DBIS).pptx
PDF
Microsoft Core Cloud Services powerpoint
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPT
Predictive modeling basics in data cleaning process
PPTX
Introduction to Inferential Statistics.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
retention in jsjsksksksnbsndjddjdnFPD.pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Leprosy and NLEP programme community medicine
modul_python (1).pptx for professional and student
IBA_Chapter_11_Slides_Final_Accessible.pptx
[EN] Industrial Machine Downtime Prediction
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Global Data and Analytics Market Outlook Report
Database Infoormation System (DBIS).pptx
Microsoft Core Cloud Services powerpoint
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Predictive modeling basics in data cleaning process
Introduction to Inferential Statistics.pptx
CYBER SECURITY the Next Warefare Tactics
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...

Spark

  • 1. Navid Kalaei Fatemeh Jamali Dr. Esmaeili Winter 2018
  • 2. Contents: • Spark • Frameworks • Ecosystem • Resilient Distributed Datasets(RDD) • A Simplified Data Flow • Executors • Iterative Operations • Fault-tolerance • Comparisons • Who uses Spark?! • Datasets • DataFrame • Scala • Practices • Pi Estimation • Spark Stream • Practice • Compile and Deploy • Spark SQL • PageView • References 2
  • 3. Spark  Apache Spark™ is a unified analytics engine for large-scale data processing.  Created by AMPLab now Databricks  Written in Scala  Licensed under Apache  Lives in Github 3
  • 5. Ecosystem Hadoop Spark Hive SparkSQL Apache Mahout MLLib Impala SparkSQL Apache Giraph Graphax Apache Storm Spark streaming 5
  • 6. Resilient Distributed Datasets  RDD is a fundamental data structure of Spark stored in memory.  It is an immutable distributed collection of objects.  Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. 6
  • 7. Resilient Distributed Datasets  RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.  Formally, an RDD is a read-only, partitioned collection of records.  RDD is a fault-tolerant collection of elements that can be operated on in parallel. 7
  • 11. Fault-tolerance  RDDs are remember the sequence of operations that created it from the original fault-tolerant input data  Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant  Data lost due to worker failure, can be 11
  • 12. Breaking the Records! Hadoop MR Record Spark Record Spark 1 PB Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 physical 6592 virtualized 6080 virtualized Cluster disk throughput 3150 GB/s (est.) 618 GB/s 570 GB/s Sort Benchmark Daytona Rules Yes Yes No Network dedicated data center, 10Gbps virtualized (EC2) 10Gbps network virtualized (EC2) 10Gbps network Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min 12
  • 17. Spark SQL and DataSet  Spark SQL is a Spark module for structured data processing  Spark SQL uses this extra information to perform extra optimizations  Dataset is a new interface that provides the benefits of RDDs with the benefits of Spark SQL’s optimized engine 17
  • 18. DataFrame  A DataFrame is a Dataset organized into named columns  It is conceptually equivalent to a table in a relational database or a data frame in R/Python  It benefits from richer optimizations under the hood 18
  • 19. Scala  Scala combines object-oriented and functional programming in one concise high-level language.  Scala's static types help avoid bugs in complex applications  Its JVM and JavaScript runtimes let you build high performance systems  It has an easy access to huge ecosystems of libraries. 19
  • 20. Practices  WordCount  Pi Estimation  Text Search 20
  • 23. Spark Stream  Framework for large scale stream processing  Scales to 100s of nodes  Provides a simple batch-like API for implementing complex algorithm 23
  • 24. Stream Processing 24  Run a streaming computation as a series of very small, deterministic batch jobs  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations
  • 25. Practices 25  Run “Stateless NetworkWordCount”  Compile Deploy a java file  Run “Stateful NetworkWordCount”  Run “PageView” and it’s generator  Execute simple data operations with Spark SQL
  • 26. Compile and Deploy 26 Compile 1. Generate project with Maven 2. Copy the Java file to the end point 3. Edit the pom.xml 4. Package the project Deploy 1. Submit the .jar file to Spark
  • 27. Apache Maven 27  Maven is a tool that can now be used for building and managing any Java-based project  Making the build process easy  Providing a uniform build system  Providing quality project information  Providing guidelines for best practices development  Allowing transparent migration to new features
  • 28. pom.xml 28  Contains project identifiers  Defines source and target versions of Java to be used  Dependencies are subjected from Maven’s repository
  • 29. Let’s Compile and Deploy! 29
  • 30. PageView  The aim is to analyze hit and miss ratio of a website  The generator simulates 100 users from 2 regions on 10 threads  Hit rate: 95% 30 100Users • 94709 • 94117 Zip Codes • /index 0.7 • /news 0.2 • /contact 0.1 Pages
  • 31. Spark SQL 31  The Spark master node connects to relational databases and loads data from a specific table or using a specific SQL query.  The Spark master node distributes data to worker nodes for transformation  The Worker node connects to the relational database and writes data  The user can choose to use row-by-row insertion or bulk insert
  • 32. References  [1]: Apache Spark officially sets a new record in large-scale sorting  [2]: 2014 Data Science Salary Survey  [3]: The Performance Comparison of Hadoop and Spark  [4] Apache Maven Project  [5] The Scala Programming Language 32