Introduction to spark 2.0

Introduction to Spark 2.0
Next Step in Spark Journey
https://github.com/phatak-dev/spark2.0-examples

● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com

Agenda
● Major focus in Spark 2.0
● Dataset abstraction
● Spark Session
● Dataset wordcount
● RDD to Dataset
● Dataset Vs Dataframe API’s
● Time window
● Custom Optimizations

Major focus of Spark 2.0
● Standardizing on Dataset abstraction
● Moving all libraries of Spark to play well with dataset
abstraction
● Making all API’s available in all languages
● Putting seeds for future directions of 2.x like structure
streaming
● Performance Improvement in order of 10X

Dataset Abstraction
● A Dataset is a strongly typed collection of domain-
specific objects that can be transformed in parallel using
functional or relational operations. Each dataset also
has an untyped view called a DataFrame, which is a
Dataset of Row
● RDD represents an immutable,partitioned collection of
elements that can be operated on in parallel
● Has custom DSL and runs on custom memory
management

Spark Session API
● New entry point in spark for creating for creating
datasets
● Replaces SQLContext,HiveContext and
StreamingContext
● Most of the programs only need to create this no more
SparkContext
● Move from SparkContext to SparkSession signifies
move away from RDD
● Ex : SparkSessionExample.scala

Mandatory Dataset WordCount
● Dataset provides very similar DSL as RDD
● It combines best of RDD and Dataframe to single API
● Dataframe is now aliased now to Dataset[Row]
● One of the big change from RDD API, is moving away
from key/value pair based API to more SQL like API
● Dataset signifies departure from well know Map/Reduce
like API to more of optimized data handling DSL
● Ex : DataSetWordCount.scala

RDD to Dataset
● Dataframe lacked functional programming aspects of
RDD which made moving code from RDD to DF more
challenging
● But with Dataset, most of the RDD expressions can be
easily expressed in more elegantly
● Though both are DSL, they differ large in
implementation
● Most of the Dataset operation is ran through code
generation and custom serialization
● Ex : RDDToDataset.scala

Dataframe vs Dataset
● Most of the logical plans and optimizations of Dataframe
are now moved into Dataset
● Dataframe is now a schema less Dataset
● One of the difference of Dataset from Dataframe is, it
adds an additional step for serialization and checking for
proper schema
● This serialization is different than spark and kryo. It’s a
macro based serialization framework
● Ex : DatasetVsDataframe.scala

Catalogue API
● In the theme of support for structured data, catalogue
API bring support to manage external metastores from
spark
● Highly useful for interactive programs like Zeppelin and
other notebooks
● Integrates well with Hive metastore
● Primary used for DDL operations
● API is built on Dataset abstraction
● Ex : CatalogueExample.scala

Time analysis
● One of the import part of any data manipulation is
handling time effectively
● In earlier versions of spark, only spark streaming
supported notion of time
● As spark 2.0 is trying to merge spark streaming with
dataset abstraction, time is now part of spark sql also
● This is more powerful, as we can use same time
abstraction both in batch and streaming operations
● Ex: TimeWindowExample.scala

Plugging custom optimisations
● Dataset abstractions runs on same catalyst optimiser as
dataframe
● As Dataset is becoming the platform level abstraction
ability to control this optimiser becomes very important
● In earlier versions, one needed to change spark source
code to inject custom optimizations
● From Spark 2.0, framework providing a public API to
inject custom optimizations without changing source
code
● Ex : CustomOptimizationExample.scala

References
● http://blog.madhukaraphatak.com/categories/spark-two/
● https://www.brighttalk.com/webcast/12891/202021
● https://spark-summit.org/2016/schedule/

Introduction to spark 2.0

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Introduction to spark 2.0 (20)

More from datamantra (17)

Recently uploaded (20)

Introduction to spark 2.0