SlideShare a Scribd company logo
Apache Spark
Concepts - Spark SQL, GraphX, Streaming
Petr Zapletal Cake Solutions
Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and Machine Learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Spark SQL, GraphX, Streaming
6) Spark’s distributed programming model
7) Deployment
Table of contents
● Resilient Distributed Datasets
● Spark SQL
● GraphX
● Spark Streaming
● Q & A
Spark Modules
Resilient Distributed Datasets
● Immutable, distributed collection of records
● Lazy evaluation, caching option, can be persisted
● Number of operations & transformations
● Can be created from data storage or different RDD
Spark SQL
● Spark’s interface to work with structured or semistructured data
● Structured data
o known set of fields for each record - schema
● Main capabilities
o load data from variety of structured sources
o query the data with SQL
o integration between Spark (Java, Scala and Python API) and SQL
(joining RDDs and SQL tables, using SQL functionality)
More than SQL
● Unified interface for structured data
SchemaRDD
● RDD of row objects, each representing a record
● Known schema (i.e. data fields) of its rows
● Behaves like regular RDD, stored in more efficient manner
● Adds new operations, especially running SQL queries
● Can be created from
o external data sources
o results of queries
o regular RDD
● Used in ML Pipeline API
SchemaRDD
Getting Started
● Entry points:
o HiveContext
 superset functionality, Hive related
o SQLContext
● Loads input JSON file into SchemaRDD
● Uses context to execute query
Query Example
Loading and Saving Data
● Supports number of structured data sources
o Apache Hive
 data warehouse infrastructure on top of Hadoop
 summarization, querying (SQL-like interface) and analysis
o Parquet
 column-oriented storage format in Hadoop ecosystem
 efficient storage of records with nested fields
o JSON
o RDDs
o JDBC/ODBC Server
 connecting Business Intelligence tools
 remote access to Spark cluster
GraphX
● New Spark API for graphs and graph-parallel computation
● Resilient Distributed Property Graph (RDPG, extends RDD)
o directed multigraph ( -> parallel edges)
o properties attached to each vertex and edge
● Common graph operations (subgraph computation, joining vertices, ...)
● Growing collection of graph algorithms
Motivation
● Growing scale and importance of graph data
● Application of data-parallel algorithms to graph computation is inefficient
● Graph-parallel systems (Pregel, PowerGraph, ...) designed for efficient
execution of graph algorithms
o do not address graph construction & transformation
o limited fault tolerance & data mining support
Performance Comparison
Property Graph
● Directed multigraph with user defined objects to each vertex and edge
Property Graph
Triplet View
● Logical join of vertex and edge properties
Graph Operations
● Basic information (numEdges, numVertices, inDegrees, ...)
● Views (vertices, edges, triplets)
● Caching (persist, cache, ...)
● Transformation (mapVertices, mapEdges, ...)
● Structure modification (reverse, subgraph, ...)
● Neighbour aggregation (collectNeighbours, aggregations, ...)
● Pregel API
● Graph builders (various I/O operations)
● ...
Graph Algorithms
● Built-in algorithms
o PageRank, Connected Components, Triangle Count, ...
Demo
Spark Streaming
● Scalable, high-throughput, fault-tolerant stream processing
Architecture
● Streams are chopped up into batches
● Each batch is processed in Spark
● Results pushed out in batches
Streaming Word Count
Streaming Word Count
StreamingContext
● Entry point for all streaming functionality
o define input sources
o stream transformations
o output operations to DStreams
o starts & stops streaming process
● Limitations
o once started, computations cannot be added
o cannot be restarted
o one active per JVM
Discretized Streams
● Basic abstraction, represents a continuous stream of data
● DStreams
● Implemented as series of RDDs
Stateless Transformations
● Processing of each batch does not depend on previous batches
● Transformation is separately applied to every batch
o Map, flatMap, filter, reduce, groupBy, …
● Combining data from multiple DStreams
o Join, cogroup, union, ...
Stateful Transformations
● Use data or intermediate results from previous batches to compute the
result of the current batch
● Windowed operations
o act over a sliding window of time periods
● UpdateStateByKey
o maintain state while continuously updating it with new information
● Require checkpointing
Output Operations
● Specify what needs to be done with the final transformed data
● Pushing to external DB, printing, …
● If not performed, DStream is not evaluated
Input Sources
● Built-in support for a number of different data sources
● Often in additional libraries (i.e. spark-streaming-kafka)
● HDFS
● Akka Actor Stream
● Apache Kafka
● Apache Flume
● Twitter Stream
● Kinesis
● Custom Sources
● ...
Demo
Conclusion
● RDD repetition
● Spark Modules Overview
o Spark SQL
o GraphX
o Spark Streaming
Questions

More Related Content

PDF
Intro to Spark and Spark SQL
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
PDF
Spark SQL - 10 Things You Need to Know
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
PDF
Apache spark - Spark's distributed programming model
PDF
Anatomy of Spark SQL Catalyst - Part 2
PDF
Anatomy of spark catalyst
PDF
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Intro to Spark and Spark SQL
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Spark SQL - 10 Things You Need to Know
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Apache spark - Spark's distributed programming model
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of spark catalyst
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)

What's hot (20)

PDF
What is Distributed Computing, Why we use Apache Spark
PPTX
Optimizing Apache Spark SQL Joins
PDF
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PPTX
Apache Spark sql
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
Road to Analytics
PDF
Distributed Stream Processing - Spark Summit East 2017
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
PDF
How Apache Spark fits into the Big Data landscape
PDF
Vertica And Spark: Connecting Computation And Data
PPTX
Spark from the Surface
PDF
Tachyon-2014-11-21-amp-camp5
PDF
Operating and Supporting Delta Lake in Production
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
PDF
Enabling exploratory data science with Spark and R
PPTX
Spark etl
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PPTX
Spark Sql for Training
What is Distributed Computing, Why we use Apache Spark
Optimizing Apache Spark SQL Joins
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Apache Spark sql
Jump Start into Apache® Spark™ and Databricks
Road to Analytics
Distributed Stream Processing - Spark Summit East 2017
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
How Apache Spark fits into the Big Data landscape
Vertica And Spark: Connecting Computation And Data
Spark from the Surface
Tachyon-2014-11-21-amp-camp5
Operating and Supporting Delta Lake in Production
Structuring Spark: DataFrames, Datasets, and Streaming
Enabling exploratory data science with Spark and R
Spark etl
Spark Summit East 2015 Advanced Devops Student Slides
Spark Sql for Training
Ad

Viewers also liked (20)

PDF
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
PDF
An excursion into Graph Analytics with Apache Spark GraphX
ODP
Graphs are everywhere! Distributed graph computing with Spark GraphX
PPTX
Introduction to Apache Spark
PDF
Real time and reliable processing with Apache Storm
PDF
GraphX: Graph analytics for insights about developer communities
PDF
Machine Learning and GraphX
PDF
Rapid Cluster Computing with Apache Spark 2016
PDF
Apache Spark 101 - Demi Ben-Ari
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
PPTX
Programming in Spark using PySpark
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PDF
Spark on YARN
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Graph processing - Powergraph and GraphX
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Machine Learning by Example - Apache Spark
PPTX
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
PDF
Graph Analytics in Spark
PPTX
Introduction to Apache Spark Developer Training
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
An excursion into Graph Analytics with Apache Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
Introduction to Apache Spark
Real time and reliable processing with Apache Storm
GraphX: Graph analytics for insights about developer communities
Machine Learning and GraphX
Rapid Cluster Computing with Apache Spark 2016
Apache Spark 101 - Demi Ben-Ari
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Programming in Spark using PySpark
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark on YARN
Deep Dive: Memory Management in Apache Spark
Graph processing - Powergraph and GraphX
Simplifying Big Data Analytics with Apache Spark
Machine Learning by Example - Apache Spark
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Graph Analytics in Spark
Introduction to Apache Spark Developer Training
Ad

Similar to Spark Concepts - Spark SQL, Graphx, Streaming (20)

PPTX
Apache Spark Components
PPTX
APACHE SPARK.pptx
PPTX
SPARK ARCHITECTURE
PPTX
Glint with Apache Spark
PPTX
Apache Spark Fundamentals
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPTX
Apache Spark for Beginners
PDF
Apache Spark - A High Level overview
PDF
Toying with spark
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PPTX
Apache Spark in Industry
PPTX
Apache Spark Overview
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPTX
Apache Spark
PPTX
Building highly scalable data pipelines with Apache Spark
PPTX
Apache Spark Core
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Apache Spark Presentation good for big data
PPTX
Big data processing with Apache Spark and Oracle Database
PPTX
Apache Spark Components
APACHE SPARK.pptx
SPARK ARCHITECTURE
Glint with Apache Spark
Apache Spark Fundamentals
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Apache Spark for Beginners
Apache Spark - A High Level overview
Toying with spark
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Apache Spark in Industry
Apache Spark Overview
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Apache Spark
Building highly scalable data pipelines with Apache Spark
Apache Spark Core
Processing Large Data with Apache Spark -- HasGeek
Apache Spark Presentation good for big data
Big data processing with Apache Spark and Oracle Database

More from Petr Zapletal (11)

PDF
Change Data Capture - Scale by the Bay 2019
PDF
Adopting GraalVM - NE Scala 2019
PDF
Adopting GraalVM - Scala eXchange London 2018
PDF
Adopting GraalVM - Scale by the Bay 2018
PDF
Real World Serverless
PDF
Reactive mistakes - ScalaDays Chicago 2017
PDF
Reactive mistakes reactive nyc
PDF
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
PDF
Distributed Real-Time Stream Processing: Why and How 2.0
PDF
Distributed real time stream processing- why and how
PPTX
MLlib and Machine Learning on Spark
Change Data Capture - Scale by the Bay 2019
Adopting GraalVM - NE Scala 2019
Adopting GraalVM - Scala eXchange London 2018
Adopting GraalVM - Scale by the Bay 2018
Real World Serverless
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes reactive nyc
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed real time stream processing- why and how
MLlib and Machine Learning on Spark

Recently uploaded (20)

PPTX
Introduction to Artificial Intelligence
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPT
Introduction Database Management System for Course Database
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Nekopoi APK 2025 free lastest update
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Introduction to Artificial Intelligence
PTS Company Brochure 2025 (1).pdf.......
Designing Intelligence for the Shop Floor.pdf
Design an Analysis of Algorithms I-SECS-1021-03
2025 Textile ERP Trends: SAP, Odoo & Oracle
How to Choose the Right IT Partner for Your Business in Malaysia
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
CHAPTER 2 - PM Management and IT Context
Reimagine Home Health with the Power of Agentic AI​
Introduction Database Management System for Course Database
Design an Analysis of Algorithms II-SECS-1021-03
Nekopoi APK 2025 free lastest update
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Odoo Companies in India – Driving Business Transformation.pdf
Computer Software and OS of computer science of grade 11.pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...

Spark Concepts - Spark SQL, Graphx, Streaming

  • 1. Apache Spark Concepts - Spark SQL, GraphX, Streaming Petr Zapletal Cake Solutions
  • 2. Apache Spark and Big Data 1) History and market overview 2) Installation 3) MLlib and Machine Learning on Spark 4) Porting R code to Scala and Spark 5) Concepts - Spark SQL, GraphX, Streaming 6) Spark’s distributed programming model 7) Deployment
  • 3. Table of contents ● Resilient Distributed Datasets ● Spark SQL ● GraphX ● Spark Streaming ● Q & A
  • 5. Resilient Distributed Datasets ● Immutable, distributed collection of records ● Lazy evaluation, caching option, can be persisted ● Number of operations & transformations ● Can be created from data storage or different RDD
  • 6. Spark SQL ● Spark’s interface to work with structured or semistructured data ● Structured data o known set of fields for each record - schema ● Main capabilities o load data from variety of structured sources o query the data with SQL o integration between Spark (Java, Scala and Python API) and SQL (joining RDDs and SQL tables, using SQL functionality)
  • 7. More than SQL ● Unified interface for structured data
  • 8. SchemaRDD ● RDD of row objects, each representing a record ● Known schema (i.e. data fields) of its rows ● Behaves like regular RDD, stored in more efficient manner ● Adds new operations, especially running SQL queries ● Can be created from o external data sources o results of queries o regular RDD ● Used in ML Pipeline API
  • 10. Getting Started ● Entry points: o HiveContext  superset functionality, Hive related o SQLContext
  • 11. ● Loads input JSON file into SchemaRDD ● Uses context to execute query Query Example
  • 12. Loading and Saving Data ● Supports number of structured data sources o Apache Hive  data warehouse infrastructure on top of Hadoop  summarization, querying (SQL-like interface) and analysis o Parquet  column-oriented storage format in Hadoop ecosystem  efficient storage of records with nested fields o JSON o RDDs o JDBC/ODBC Server  connecting Business Intelligence tools  remote access to Spark cluster
  • 13. GraphX ● New Spark API for graphs and graph-parallel computation ● Resilient Distributed Property Graph (RDPG, extends RDD) o directed multigraph ( -> parallel edges) o properties attached to each vertex and edge ● Common graph operations (subgraph computation, joining vertices, ...) ● Growing collection of graph algorithms
  • 14. Motivation ● Growing scale and importance of graph data ● Application of data-parallel algorithms to graph computation is inefficient ● Graph-parallel systems (Pregel, PowerGraph, ...) designed for efficient execution of graph algorithms o do not address graph construction & transformation o limited fault tolerance & data mining support
  • 16. Property Graph ● Directed multigraph with user defined objects to each vertex and edge
  • 18. Triplet View ● Logical join of vertex and edge properties
  • 19. Graph Operations ● Basic information (numEdges, numVertices, inDegrees, ...) ● Views (vertices, edges, triplets) ● Caching (persist, cache, ...) ● Transformation (mapVertices, mapEdges, ...) ● Structure modification (reverse, subgraph, ...) ● Neighbour aggregation (collectNeighbours, aggregations, ...) ● Pregel API ● Graph builders (various I/O operations) ● ...
  • 20. Graph Algorithms ● Built-in algorithms o PageRank, Connected Components, Triangle Count, ...
  • 21. Demo
  • 22. Spark Streaming ● Scalable, high-throughput, fault-tolerant stream processing
  • 23. Architecture ● Streams are chopped up into batches ● Each batch is processed in Spark ● Results pushed out in batches
  • 26. StreamingContext ● Entry point for all streaming functionality o define input sources o stream transformations o output operations to DStreams o starts & stops streaming process ● Limitations o once started, computations cannot be added o cannot be restarted o one active per JVM
  • 27. Discretized Streams ● Basic abstraction, represents a continuous stream of data ● DStreams ● Implemented as series of RDDs
  • 28. Stateless Transformations ● Processing of each batch does not depend on previous batches ● Transformation is separately applied to every batch o Map, flatMap, filter, reduce, groupBy, … ● Combining data from multiple DStreams o Join, cogroup, union, ...
  • 29. Stateful Transformations ● Use data or intermediate results from previous batches to compute the result of the current batch ● Windowed operations o act over a sliding window of time periods ● UpdateStateByKey o maintain state while continuously updating it with new information ● Require checkpointing
  • 30. Output Operations ● Specify what needs to be done with the final transformed data ● Pushing to external DB, printing, … ● If not performed, DStream is not evaluated
  • 31. Input Sources ● Built-in support for a number of different data sources ● Often in additional libraries (i.e. spark-streaming-kafka) ● HDFS ● Akka Actor Stream ● Apache Kafka ● Apache Flume ● Twitter Stream ● Kinesis ● Custom Sources ● ...
  • 32. Demo
  • 33. Conclusion ● RDD repetition ● Spark Modules Overview o Spark SQL o GraphX o Spark Streaming

Editor's Notes

  • #11: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[2] While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.[3][4] Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.[5]
  • #16: Connected Components and PageRank algorithms https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf For Spark we implemented the algorithms both using idiomatic dataflow operators (Naive Spark, as described in Section 3.2) and using an optimized implementation (Optimized Spark) that eliminates movement of edge data by pre-partitioning the edges to match the partitioning adopted by GraphX. We have excluded Giraph and Optimized Spark from Figure 7c because they were unable to scale to the larger web-graph in the allotted memory of the cluster. While the basic Spark implementation did not crash, it was forced to re-compute blocks from disk and exceeded 8000 seconds per iteration. We attribute the increased memory overhead to the use of edge-cut partitioning and the need to store bi-directed edges and messages for the connected components algorithm
  • #17: https://spark.apache.org/docs/latest/graphx-programming-guide.html
  • #29: cogroup - When called on DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples. join - When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key. union - Return a new DStream that contains the union of the elements in the source DStream and otherDStream.