Spark Concepts - Spark SQL, Graphx, Streaming

Apache Spark
Concepts - Spark SQL, GraphX, Streaming
Petr Zapletal Cake Solutions

Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and Machine Learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Spark SQL, GraphX, Streaming
6) Spark’s distributed programming model
7) Deployment

Table of contents
● Resilient Distributed Datasets
● Spark SQL
● GraphX
● Spark Streaming
● Q & A

Resilient Distributed Datasets
● Immutable, distributed collection of records
● Lazy evaluation, caching option, can be persisted
● Number of operations & transformations
● Can be created from data storage or different RDD

Spark SQL
● Spark’s interface to work with structured or semistructured data
● Structured data
o known set of fields for each record - schema
● Main capabilities
o load data from variety of structured sources
o query the data with SQL
o integration between Spark (Java, Scala and Python API) and SQL
(joining RDDs and SQL tables, using SQL functionality)

More than SQL
● Unified interface for structured data

SchemaRDD
● RDD of row objects, each representing a record
● Known schema (i.e. data fields) of its rows
● Behaves like regular RDD, stored in more efficient manner
● Adds new operations, especially running SQL queries
● Can be created from
o external data sources
o results of queries
o regular RDD
● Used in ML Pipeline API

Getting Started
● Entry points:
o HiveContext
 superset functionality, Hive related
o SQLContext

● Loads input JSON file into SchemaRDD
● Uses context to execute query
Query Example

Loading and Saving Data
● Supports number of structured data sources
o Apache Hive
 data warehouse infrastructure on top of Hadoop
 summarization, querying (SQL-like interface) and analysis
o Parquet
 column-oriented storage format in Hadoop ecosystem
 efficient storage of records with nested fields
o JSON
o RDDs
o JDBC/ODBC Server
 connecting Business Intelligence tools
 remote access to Spark cluster

GraphX
● New Spark API for graphs and graph-parallel computation
● Resilient Distributed Property Graph (RDPG, extends RDD)
o directed multigraph ( -> parallel edges)
o properties attached to each vertex and edge
● Common graph operations (subgraph computation, joining vertices, ...)
● Growing collection of graph algorithms

Motivation
● Growing scale and importance of graph data
● Application of data-parallel algorithms to graph computation is inefficient
● Graph-parallel systems (Pregel, PowerGraph, ...) designed for efficient
execution of graph algorithms
o do not address graph construction & transformation
o limited fault tolerance & data mining support

Property Graph
● Directed multigraph with user defined objects to each vertex and edge

Triplet View
● Logical join of vertex and edge properties

Graph Operations
● Basic information (numEdges, numVertices, inDegrees, ...)
● Views (vertices, edges, triplets)
● Caching (persist, cache, ...)
● Transformation (mapVertices, mapEdges, ...)
● Structure modification (reverse, subgraph, ...)
● Neighbour aggregation (collectNeighbours, aggregations, ...)
● Pregel API
● Graph builders (various I/O operations)
● ...

Graph Algorithms
● Built-in algorithms
o PageRank, Connected Components, Triangle Count, ...

Spark Streaming
● Scalable, high-throughput, fault-tolerant stream processing

Architecture
● Streams are chopped up into batches
● Each batch is processed in Spark
● Results pushed out in batches

StreamingContext
● Entry point for all streaming functionality
o define input sources
o stream transformations
o output operations to DStreams
o starts & stops streaming process
● Limitations
o once started, computations cannot be added
o cannot be restarted
o one active per JVM

Discretized Streams
● Basic abstraction, represents a continuous stream of data
● DStreams
● Implemented as series of RDDs

Stateless Transformations
● Processing of each batch does not depend on previous batches
● Transformation is separately applied to every batch
o Map, flatMap, filter, reduce, groupBy, …
● Combining data from multiple DStreams
o Join, cogroup, union, ...

Stateful Transformations
● Use data or intermediate results from previous batches to compute the
result of the current batch
● Windowed operations
o act over a sliding window of time periods
● UpdateStateByKey
o maintain state while continuously updating it with new information
● Require checkpointing

Output Operations
● Specify what needs to be done with the final transformed data
● Pushing to external DB, printing, …
● If not performed, DStream is not evaluated

Input Sources
● Built-in support for a number of different data sources
● Often in additional libraries (i.e. spark-streaming-kafka)
● HDFS
● Akka Actor Stream
● Apache Kafka
● Apache Flume
● Twitter Stream
● Kinesis
● Custom Sources
● ...

Conclusion
● RDD repetition
● Spark Modules Overview
o Spark SQL
o GraphX
o Spark Streaming

Spark Concepts - Spark SQL, Graphx, Streaming

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Spark Concepts - Spark SQL, Graphx, Streaming (20)

More from Petr Zapletal (11)

Recently uploaded (20)

Spark Concepts - Spark SQL, Graphx, Streaming

Editor's Notes