SlideShare a Scribd company logo
Introduction to Spark 2.0
Next Step in Spark Journey
https://github.com/phatak-dev/spark2.0-examples
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Major focus in Spark 2.0
● Dataset abstraction
● Spark Session
● Dataset wordcount
● RDD to Dataset
● Dataset Vs Dataframe API’s
● Time window
● Custom Optimizations
Major focus of Spark 2.0
● Standardizing on Dataset abstraction
● Moving all libraries of Spark to play well with dataset
abstraction
● Making all API’s available in all languages
● Putting seeds for future directions of 2.x like structure
streaming
● Performance Improvement in order of 10X
Dataset Abstraction
● A Dataset is a strongly typed collection of domain-
specific objects that can be transformed in parallel using
functional or relational operations. Each dataset also
has an untyped view called a DataFrame, which is a
Dataset of Row
● RDD represents an immutable,partitioned collection of
elements that can be operated on in parallel
● Has custom DSL and runs on custom memory
management
Spark Session API
● New entry point in spark for creating for creating
datasets
● Replaces SQLContext,HiveContext and
StreamingContext
● Most of the programs only need to create this no more
SparkContext
● Move from SparkContext to SparkSession signifies
move away from RDD
● Ex : SparkSessionExample.scala
Mandatory Dataset WordCount
● Dataset provides very similar DSL as RDD
● It combines best of RDD and Dataframe to single API
● Dataframe is now aliased now to Dataset[Row]
● One of the big change from RDD API, is moving away
from key/value pair based API to more SQL like API
● Dataset signifies departure from well know Map/Reduce
like API to more of optimized data handling DSL
● Ex : DataSetWordCount.scala
RDD to Dataset
● Dataframe lacked functional programming aspects of
RDD which made moving code from RDD to DF more
challenging
● But with Dataset, most of the RDD expressions can be
easily expressed in more elegantly
● Though both are DSL, they differ large in
implementation
● Most of the Dataset operation is ran through code
generation and custom serialization
● Ex : RDDToDataset.scala
Dataframe vs Dataset
● Most of the logical plans and optimizations of Dataframe
are now moved into Dataset
● Dataframe is now a schema less Dataset
● One of the difference of Dataset from Dataframe is, it
adds an additional step for serialization and checking for
proper schema
● This serialization is different than spark and kryo. It’s a
macro based serialization framework
● Ex : DatasetVsDataframe.scala
Catalogue API
● In the theme of support for structured data, catalogue
API bring support to manage external metastores from
spark
● Highly useful for interactive programs like Zeppelin and
other notebooks
● Integrates well with Hive metastore
● Primary used for DDL operations
● API is built on Dataset abstraction
● Ex : CatalogueExample.scala
Time analysis
● One of the import part of any data manipulation is
handling time effectively
● In earlier versions of spark, only spark streaming
supported notion of time
● As spark 2.0 is trying to merge spark streaming with
dataset abstraction, time is now part of spark sql also
● This is more powerful, as we can use same time
abstraction both in batch and streaming operations
● Ex: TimeWindowExample.scala
Plugging custom optimisations
● Dataset abstractions runs on same catalyst optimiser as
dataframe
● As Dataset is becoming the platform level abstraction
ability to control this optimiser becomes very important
● In earlier versions, one needed to change spark source
code to inject custom optimizations
● From Spark 2.0, framework providing a public API to
inject custom optimizations without changing source
code
● Ex : CustomOptimizationExample.scala
References
● http://blog.madhukaraphatak.com/categories/spark-two/
● https://www.brighttalk.com/webcast/12891/202021
● https://spark-summit.org/2016/schedule/

More Related Content

PDF
Real time ETL processing using Spark streaming
PDF
Introduction to dataset
PDF
Introduction to Spark 2.0 Dataset API
PDF
Migrating to Spark 2.0 - Part 2
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
PDF
Interactive workflow management using Azkaban
PDF
Introduction to Datasource V2 API
PDF
Improving Apache Spark's Reliability with DataSourceV2
Real time ETL processing using Spark streaming
Introduction to dataset
Introduction to Spark 2.0 Dataset API
Migrating to Spark 2.0 - Part 2
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Interactive workflow management using Azkaban
Introduction to Datasource V2 API
Improving Apache Spark's Reliability with DataSourceV2

What's hot (20)

PPTX
Introduction to Spark - DataFactZ
PDF
Building end to end streaming application on Spark
PDF
Anatomy of spark catalyst
PDF
Introduction to Apache Spark 2.0
PDF
Introduction to Apache Spark
PDF
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
PPTX
What's New in Spark 2?
PDF
Learning spark ch01 - Introduction to Data Analysis with Spark
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
PDF
Introduction to Structured Data Processing with Spark SQL
PPTX
Multi Source Data Analysis using Spark and Tellius
PPTX
Apache spark - History and market overview
PDF
Optimizing Apache Spark UDFs
PDF
Anatomy of Spark SQL Catalyst - Part 2
PDF
How Apache Spark fits into the Big Data landscape
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Spark on yarn
PPTX
Building real time Data Pipeline using Spark Streaming
PDF
Introduction to Spark Streaming
PDF
Introduction to Flink Streaming
Introduction to Spark - DataFactZ
Building end to end streaming application on Spark
Anatomy of spark catalyst
Introduction to Apache Spark 2.0
Introduction to Apache Spark
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
What's New in Spark 2?
Learning spark ch01 - Introduction to Data Analysis with Spark
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
Introduction to Structured Data Processing with Spark SQL
Multi Source Data Analysis using Spark and Tellius
Apache spark - History and market overview
Optimizing Apache Spark UDFs
Anatomy of Spark SQL Catalyst - Part 2
How Apache Spark fits into the Big Data landscape
Jump Start with Apache Spark 2.0 on Databricks
Spark on yarn
Building real time Data Pipeline using Spark Streaming
Introduction to Spark Streaming
Introduction to Flink Streaming
Ad

Viewers also liked (20)

PDF
Introduction to Apache Spark
PDF
Productionalizing a spark application
PPTX
Introduction to Apache Spark
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
Functional programming in Scala
PDF
Evolution of apache spark
PDF
Interactive Data Analysis in Spark Streaming
PDF
Spark architecture
PPTX
Introduction to Apache Spark Developer Training
PPTX
Apache Spark Architecture
PPTX
Spark 1.6 vs Spark 2.0
PDF
Anatomy of in memory processing in Spark
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Parallelizing Existing R Packages with SparkR
PDF
Introduction to Spark Internals
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
PDF
Introduction to Apache Spark
Introduction to Apache Spark
Productionalizing a spark application
Introduction to Apache Spark
Apache Spark 2.0: Faster, Easier, and Smarter
Functional programming in Scala
Evolution of apache spark
Interactive Data Analysis in Spark Streaming
Spark architecture
Introduction to Apache Spark Developer Training
Apache Spark Architecture
Spark 1.6 vs Spark 2.0
Anatomy of in memory processing in Spark
Simplifying Big Data Analytics with Apache Spark
Parallelizing Existing R Packages with SparkR
Introduction to Spark Internals
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
SparkSQL: A Compiler from Queries to RDDs
Introducing DataFrames in Spark for Large Scale Data Science
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Introduction to Apache Spark
Ad

Similar to Introduction to spark 2.0 (20)

PDF
Migrating to spark 2.0
PDF
Introduction to Apache Flink
PPTX
Apache Spark for Beginners
ODP
A Step to programming with Apache Spark
PDF
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
PDF
Apache spark on Hadoop Yarn Resource Manager
PPTX
Apache Spark on HDinsight Training
PDF
Data processing with spark in r & python
PDF
Real Time Analytics with Dse
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PPTX
Apache Spark
PPTX
Apache Spark Core
PPTX
Spark from the Surface
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPTX
Dataflow.pptx
PDF
Modern ETL Pipelines with Change Data Capture
PDF
Introduction to Structured streaming
PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
PDF
Apache spark y cómo lo usamos en nuestros proyectos
PDF
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Migrating to spark 2.0
Introduction to Apache Flink
Apache Spark for Beginners
A Step to programming with Apache Spark
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Apache spark on Hadoop Yarn Resource Manager
Apache Spark on HDinsight Training
Data processing with spark in r & python
Real Time Analytics with Dse
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Apache Spark
Apache Spark Core
Spark from the Surface
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Dataflow.pptx
Modern ETL Pipelines with Change Data Capture
Introduction to Structured streaming
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Apache spark y cómo lo usamos en nuestros proyectos
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :

More from datamantra (17)

PPTX
State management in Structured Streaming
PDF
Spark on Kubernetes
PDF
Understanding transactional writes in datasource v2
PDF
Exploratory Data Analysis in Spark
PDF
Core Services behind Spark Job Execution
PDF
Optimizing S3 Write-heavy Spark workloads
PDF
Structured Streaming with Kafka
PDF
Understanding time in structured streaming
PDF
Spark stack for Model life-cycle management
PDF
Productionalizing Spark ML
PDF
Testing Spark and Scala
PDF
Understanding Implicits in Scala
PDF
Scalable Spark deployment using Kubernetes
PDF
Introduction to concurrent programming with akka actors
PPTX
Telco analytics at scale
PPTX
Platform for Data Scientists
PDF
Building scalable rest service using Akka HTTP
State management in Structured Streaming
Spark on Kubernetes
Understanding transactional writes in datasource v2
Exploratory Data Analysis in Spark
Core Services behind Spark Job Execution
Optimizing S3 Write-heavy Spark workloads
Structured Streaming with Kafka
Understanding time in structured streaming
Spark stack for Model life-cycle management
Productionalizing Spark ML
Testing Spark and Scala
Understanding Implicits in Scala
Scalable Spark deployment using Kubernetes
Introduction to concurrent programming with akka actors
Telco analytics at scale
Platform for Data Scientists
Building scalable rest service using Akka HTTP

Recently uploaded (20)

PDF
Getting Started with Data Integration: FME Form 101
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Machine Learning_overview_presentation.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
August Patch Tuesday
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Getting Started with Data Integration: FME Form 101
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
TLE Review Electricity (Electricity).pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectroscopy.pptx food analysis technology
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine Learning_overview_presentation.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Building Integrated photovoltaic BIPV_UPV.pdf
A comparative analysis of optical character recognition models for extracting...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation_ Review paper, used for researhc scholars
August Patch Tuesday
Advanced methodologies resolving dimensionality complications for autism neur...
MIND Revenue Release Quarter 2 2025 Press Release

Introduction to spark 2.0

  • 1. Introduction to Spark 2.0 Next Step in Spark Journey https://github.com/phatak-dev/spark2.0-examples
  • 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Consultant and Trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Major focus in Spark 2.0 ● Dataset abstraction ● Spark Session ● Dataset wordcount ● RDD to Dataset ● Dataset Vs Dataframe API’s ● Time window ● Custom Optimizations
  • 4. Major focus of Spark 2.0 ● Standardizing on Dataset abstraction ● Moving all libraries of Spark to play well with dataset abstraction ● Making all API’s available in all languages ● Putting seeds for future directions of 2.x like structure streaming ● Performance Improvement in order of 10X
  • 5. Dataset Abstraction ● A Dataset is a strongly typed collection of domain- specific objects that can be transformed in parallel using functional or relational operations. Each dataset also has an untyped view called a DataFrame, which is a Dataset of Row ● RDD represents an immutable,partitioned collection of elements that can be operated on in parallel ● Has custom DSL and runs on custom memory management
  • 6. Spark Session API ● New entry point in spark for creating for creating datasets ● Replaces SQLContext,HiveContext and StreamingContext ● Most of the programs only need to create this no more SparkContext ● Move from SparkContext to SparkSession signifies move away from RDD ● Ex : SparkSessionExample.scala
  • 7. Mandatory Dataset WordCount ● Dataset provides very similar DSL as RDD ● It combines best of RDD and Dataframe to single API ● Dataframe is now aliased now to Dataset[Row] ● One of the big change from RDD API, is moving away from key/value pair based API to more SQL like API ● Dataset signifies departure from well know Map/Reduce like API to more of optimized data handling DSL ● Ex : DataSetWordCount.scala
  • 8. RDD to Dataset ● Dataframe lacked functional programming aspects of RDD which made moving code from RDD to DF more challenging ● But with Dataset, most of the RDD expressions can be easily expressed in more elegantly ● Though both are DSL, they differ large in implementation ● Most of the Dataset operation is ran through code generation and custom serialization ● Ex : RDDToDataset.scala
  • 9. Dataframe vs Dataset ● Most of the logical plans and optimizations of Dataframe are now moved into Dataset ● Dataframe is now a schema less Dataset ● One of the difference of Dataset from Dataframe is, it adds an additional step for serialization and checking for proper schema ● This serialization is different than spark and kryo. It’s a macro based serialization framework ● Ex : DatasetVsDataframe.scala
  • 10. Catalogue API ● In the theme of support for structured data, catalogue API bring support to manage external metastores from spark ● Highly useful for interactive programs like Zeppelin and other notebooks ● Integrates well with Hive metastore ● Primary used for DDL operations ● API is built on Dataset abstraction ● Ex : CatalogueExample.scala
  • 11. Time analysis ● One of the import part of any data manipulation is handling time effectively ● In earlier versions of spark, only spark streaming supported notion of time ● As spark 2.0 is trying to merge spark streaming with dataset abstraction, time is now part of spark sql also ● This is more powerful, as we can use same time abstraction both in batch and streaming operations ● Ex: TimeWindowExample.scala
  • 12. Plugging custom optimisations ● Dataset abstractions runs on same catalyst optimiser as dataframe ● As Dataset is becoming the platform level abstraction ability to control this optimiser becomes very important ● In earlier versions, one needed to change spark source code to inject custom optimizations ● From Spark 2.0, framework providing a public API to inject custom optimizations without changing source code ● Ex : CustomOptimizationExample.scala