SlideShare a Scribd company logo
Structuring Apache Spark
SQL, DataFrames, Datasets, and Streaming
Michael Armbrust- @michaelarmbrust
Spark Summit 2016
Background: What is in an RDD?
•Dependencies
•Partitions (with optional locality info)
•Compute function: Partition => Iterator[T]
2
Background: What is in an RDD?
•Dependencies
•Partitions (with optional locality info)
•Compute function: Partition => Iterator[T]
3
Opaque Computation
Background: What is in an RDD?
•Dependencies
•Partitions (with optional locality info)
•Compute function: Partition => Iterator[T]
4
Opaque Data
Struc·ture
[ˈstrək(t)SHər]
verb
1. construct or arrange according to a
plan; give a pattern or organization to.
5
Why structure?
• By definition, structure will limit what can be
expressed.
• In practice, we can accommodate the vast
majority of computations.
6
Limiting the space of what can be
expressed enables optimizations.
Structured APIs In Spark
7
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Time
Compile
Time
Runtime
Analysis errors reported before a distributed job starts
Type-safe: operate
on domain objects
with compiled
lambda functions
8
Datasets API
val df = spark.read.json("people.json")
// Convert data to domain objects.
case class Person(name: String, age: Int)
val ds: Dataset[Person] = df.as[Person]
ds.filter(_.age > 30)
// Compute histogram of age by name.
val hist = ds.groupBy(_.name).mapGroups {
case (name, people: Iter[Person]) =>
val buckets = new Array[Int](10)
people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}
DataFrame = Dataset[Row]
•Spark 2.0 unifies these APIs
•Stringly-typed methods will downcast to
generic Row objects
•Ask Spark SQL to enforce types on
generic rows using df.as[MyClass]
9
What about ?
Some of the goals of the Dataset API have always been
available!
10
df.map(lambda x: x.name)
df.map(x => x(0).asInstanceOf[String])
Shared Optimization & Execution
11
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames, Datasets and SQL
sharethe same optimization/execution pipeline
Dataset
Structuring Computation
12
Columns
col("x") === 1
df("x") === 1
expr("x = 1")
sql("SELECT … WHERE x = 1")
13
New value, computed based on input values.
DSL
SQL Parser
• 100+ native functionswith
optimized codegen
implementations
– String manipulation – concat,
format_string, lower, lpad
– Data/Time – current_timestamp,
date_format, date_add, …
– Math – sqrt, randn, …
– Other –
monotonicallyIncreasingId,
sparkPartitionId, …
14
Complex Columns With Functions
from pyspark.sql.functions import *
yesterday = date_sub(current_date(), 1)
df2 = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(), 1)
val df2 = df.filter(df("created_at") > yesterday)
Functions
15
(x: Int) => x == 1
Columns
col("x") === 1You Type
Spark Sees class $anonfun$1	{
def apply(Int): Boolean
}
EqualTo(x, Lit(1))
Columns: Predicate pushdown
spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "people")
.load()
.where($"name" === "michael")
16
You Write
Spark Translates
For Postgres SELECT * FROM people WHERE name = 'michael'
Columns: Efficient Joins
df1.join(df2, col("x") == col("y"))
17
df1 df2
SortMergeJoin
myUDF = udf(lambda x, y: x == y)
df1.join(df2, myUDF(col("x"), col("y")))
df1 df2
Cartisian
Filter
n2
n log n
Equal values sort to
the same place
Structuring Data
18
Spark's Structured Data Model
• Primitives: Byte, Short, Integer,Long, Float,
Double, Decimal,String, Binary, Boolean,
Timestamp, Date
• Array[Type]: variable length collection
• Struct: fixed # of nested columns with fixed types
• Map[Type, Type]: variable length association
19
6 “bricks”
Tungsten’s Compact Encoding
20
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Null bitmap
Offset to data
Offset to data Field lengths
Encoders
21
6 “bricks”0x0 123 32L 48L 4 “data”
JVM Object
Internal Representation
MyClass(123, “data”, “bricks”)
Encoders translate between domain
objects and Spark's internal format
Bridge Objects with Data Sources
22
{
"name": "Michael",
"zip": "94709"
"languages": ["scala"]
}
case class Person(
name: String,
languages: Seq[String],
zip: Int)
Encoders map columns
to fields by name
{ JSON } JDBC
Space Efficiency
23
Serialization performance
24
Operate Directly On Serialized Data
25
df.where(df("year") > 2015)
GreaterThan(year#234, Literal(2015))
bool filter(Object baseObject) {
int offset = baseOffset + bitSetWidthInBytes + 3*8L;
int value = Platform.getInt(baseObject, offset);
return value34 > 2015;
}
DataFrame Code / SQL
Catalyst Expressions
Low-level bytecode
JVM intrinsic JIT-ed to
pointer arithmetic
Platform.getInt(baseObject, offset);
Structured Streaming
26
The simplest way to perform streaming analytics
is not having to reason about streaming.
ApacheSpark 2.0
Continuous DataFrames
ApacheSpark 1.3
Static DataFrames
Single API !
Structured Streaming
• High-level streaming API built on Apache SparkSQL engine
• Runsthe same querieson DataFrames
• Eventtime, windowing,sessions,sources& sinks
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queriesatruntime
• Build and apply ML models
logs = spark.read.format("json").open("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.save("jdbc:mysql//...")
Example: Batch Aggregation
logs = spark.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
Example: Continuous Aggregation
Logically:
DataFrame operations on static data
(i.e. as easyto understand as batch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally and continuously)
DataFrame
Logical Plan
Continuous,
incremental execution
Catalyst optimizer
Execution
Incrementalized By Spark
Scan Files
Aggregate
Write to MySQL
Scan New Files
Stateful
Aggregate
Update MySQL
Batch Continuous
Transformation
requires
information
about the
structure
What's Coming?
• ApacheSpark 2.0
• Unification ofthe DataFrame/Dataset & *ContextAPIs
• Basic streaming API
• Event-time aggregations
• ApacheSpark 2.1+
• Other streaming sources / sinks
• Machine learning
• Watermarks
• Structurein other libraries: MLlib, GraphFrames
34
Questions?
@michaelarmbrust

More Related Content

PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
A look under the hood at Apache Spark's API and engine evolutions
Jump Start with Apache Spark 2.0 on Databricks
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Exceptions are the Norm: Dealing with Bad Actors in ETL
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Strata NYC 2015: What's new in Spark Streaming
Easy, scalable, fault tolerant stream processing with structured streaming - ...
A look under the hood at Apache Spark's API and engine evolutions

What's hot (20)

PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
PDF
A look ahead at spark 2.0
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
A Deep Dive into Structured Streaming in Apache Spark
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PPTX
Use r tutorial part1, introduction to sparkr
PDF
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
PDF
Making Structured Streaming Ready for Production
PPTX
Building a modern Application with DataFrames
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
PDF
Apache Spark RDDs
PDF
Enabling exploratory data science with Spark and R
PDF
Spark SQL - 10 Things You Need to Know
PDF
20140908 spark sql & catalyst
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
A look ahead at spark 2.0
From Pipelines to Refineries: Scaling Big Data Applications
A Deep Dive into Structured Streaming in Apache Spark
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Use r tutorial part1, introduction to sparkr
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Making Structured Streaming Ready for Production
Building a modern Application with DataFrames
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Apache Spark RDDs
Enabling exploratory data science with Spark and R
Spark SQL - 10 Things You Need to Know
20140908 spark sql & catalyst
SparkSQL: A Compiler from Queries to RDDs
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Simplifying Big Data Analytics with Apache Spark
Spark Summit EU 2015: Lessons from 300+ production users
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Ad

Similar to Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust (20)

PDF
Structuring Spark: DataFrames, Datasets, and Streaming
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
Spark Sql and DataFrame
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PDF
Beyond SQL: Speeding up Spark with DataFrames
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
Introduce spark (by 조창원)
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
PPTX
Flink internals web
PDF
New Developments in Spark
PDF
Fossasia 2018-chetan-khatri
PDF
Intro to Spark and Spark SQL
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
20170126 big data processing
PDF
Apache Spark, the Next Generation Cluster Computing
PDF
Strata NYC 2015 - What's coming for the Spark community
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Real-Time Spark: From Interactive Queries to Streaming
Spark Sql and DataFrame
Jump Start into Apache® Spark™ and Databricks
Spark SQL Deep Dive @ Melbourne Spark Meetup
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Beyond SQL: Speeding up Spark with DataFrames
AI與大數據數據處理 Spark實戰(20171216)
Introduce spark (by 조창원)
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Flink internals web
New Developments in Spark
Fossasia 2018-chetan-khatri
Intro to Spark and Spark SQL
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
20170126 big data processing
Apache Spark, the Next Generation Cluster Computing
Strata NYC 2015 - What's coming for the Spark community
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Transform Your Business with a Software ERP System
PDF
Nekopoi APK 2025 free lastest update
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Introduction to Artificial Intelligence
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
assetexplorer- product-overview - presentation
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
System and Network Administration Chapter 2
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
CHAPTER 2 - PM Management and IT Context
How to Choose the Right IT Partner for Your Business in Malaysia
Transform Your Business with a Software ERP System
Nekopoi APK 2025 free lastest update
Wondershare Filmora 15 Crack With Activation Key [2025
Digital Systems & Binary Numbers (comprehensive )
Softaken Excel to vCard Converter Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Introduction to Artificial Intelligence
PTS Company Brochure 2025 (1).pdf.......
Which alternative to Crystal Reports is best for small or large businesses.pdf
Why Generative AI is the Future of Content, Code & Creativity?
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Navsoft: AI-Powered Business Solutions & Custom Software Development
assetexplorer- product-overview - presentation
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Odoo Companies in India – Driving Business Transformation.pdf
System and Network Administration Chapter 2

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust

  • 1. Structuring Apache Spark SQL, DataFrames, Datasets, and Streaming Michael Armbrust- @michaelarmbrust Spark Summit 2016
  • 2. Background: What is in an RDD? •Dependencies •Partitions (with optional locality info) •Compute function: Partition => Iterator[T] 2
  • 3. Background: What is in an RDD? •Dependencies •Partitions (with optional locality info) •Compute function: Partition => Iterator[T] 3 Opaque Computation
  • 4. Background: What is in an RDD? •Dependencies •Partitions (with optional locality info) •Compute function: Partition => Iterator[T] 4 Opaque Data
  • 5. Struc·ture [ˈstrək(t)SHər] verb 1. construct or arrange according to a plan; give a pattern or organization to. 5
  • 6. Why structure? • By definition, structure will limit what can be expressed. • In practice, we can accommodate the vast majority of computations. 6 Limiting the space of what can be expressed enables optimizations.
  • 7. Structured APIs In Spark 7 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors reported before a distributed job starts
  • 8. Type-safe: operate on domain objects with compiled lambda functions 8 Datasets API val df = spark.read.json("people.json") // Convert data to domain objects. case class Person(name: String, age: Int) val ds: Dataset[Person] = df.as[Person] ds.filter(_.age > 30) // Compute histogram of age by name. val hist = ds.groupBy(_.name).mapGroups { case (name, people: Iter[Person]) => val buckets = new Array[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) }
  • 9. DataFrame = Dataset[Row] •Spark 2.0 unifies these APIs •Stringly-typed methods will downcast to generic Row objects •Ask Spark SQL to enforce types on generic rows using df.as[MyClass] 9
  • 10. What about ? Some of the goals of the Dataset API have always been available! 10 df.map(lambda x: x.name) df.map(x => x(0).asInstanceOf[String])
  • 11. Shared Optimization & Execution 11 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames, Datasets and SQL sharethe same optimization/execution pipeline Dataset
  • 13. Columns col("x") === 1 df("x") === 1 expr("x = 1") sql("SELECT … WHERE x = 1") 13 New value, computed based on input values. DSL SQL Parser
  • 14. • 100+ native functionswith optimized codegen implementations – String manipulation – concat, format_string, lower, lpad – Data/Time – current_timestamp, date_format, date_add, … – Math – sqrt, randn, … – Other – monotonicallyIncreasingId, sparkPartitionId, … 14 Complex Columns With Functions from pyspark.sql.functions import * yesterday = date_sub(current_date(), 1) df2 = df.filter(df.created_at > yesterday) import org.apache.spark.sql.functions._ val yesterday = date_sub(current_date(), 1) val df2 = df.filter(df("created_at") > yesterday)
  • 15. Functions 15 (x: Int) => x == 1 Columns col("x") === 1You Type Spark Sees class $anonfun$1 { def apply(Int): Boolean } EqualTo(x, Lit(1))
  • 16. Columns: Predicate pushdown spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 16 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'
  • 17. Columns: Efficient Joins df1.join(df2, col("x") == col("y")) 17 df1 df2 SortMergeJoin myUDF = udf(lambda x, y: x == y) df1.join(df2, myUDF(col("x"), col("y"))) df1 df2 Cartisian Filter n2 n log n Equal values sort to the same place
  • 19. Spark's Structured Data Model • Primitives: Byte, Short, Integer,Long, Float, Double, Decimal,String, Binary, Boolean, Timestamp, Date • Array[Type]: variable length collection • Struct: fixed # of nested columns with fixed types • Map[Type, Type]: variable length association 19
  • 20. 6 “bricks” Tungsten’s Compact Encoding 20 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Null bitmap Offset to data Offset to data Field lengths
  • 21. Encoders 21 6 “bricks”0x0 123 32L 48L 4 “data” JVM Object Internal Representation MyClass(123, “data”, “bricks”) Encoders translate between domain objects and Spark's internal format
  • 22. Bridge Objects with Data Sources 22 { "name": "Michael", "zip": "94709" "languages": ["scala"] } case class Person( name: String, languages: Seq[String], zip: Int) Encoders map columns to fields by name { JSON } JDBC
  • 25. Operate Directly On Serialized Data 25 df.where(df("year") > 2015) GreaterThan(year#234, Literal(2015)) bool filter(Object baseObject) { int offset = baseOffset + bitSetWidthInBytes + 3*8L; int value = Platform.getInt(baseObject, offset); return value34 > 2015; } DataFrame Code / SQL Catalyst Expressions Low-level bytecode JVM intrinsic JIT-ed to pointer arithmetic Platform.getInt(baseObject, offset);
  • 27. The simplest way to perform streaming analytics is not having to reason about streaming.
  • 28. ApacheSpark 2.0 Continuous DataFrames ApacheSpark 1.3 Static DataFrames Single API !
  • 29. Structured Streaming • High-level streaming API built on Apache SparkSQL engine • Runsthe same querieson DataFrames • Eventtime, windowing,sessions,sources& sinks • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serve using JDBC • Change queriesatruntime • Build and apply ML models
  • 32. Logically: DataFrame operations on static data (i.e. as easyto understand as batch) Physically: Spark automatically runs the queryin streaming fashion (i.e. incrementally and continuously) DataFrame Logical Plan Continuous, incremental execution Catalyst optimizer Execution
  • 33. Incrementalized By Spark Scan Files Aggregate Write to MySQL Scan New Files Stateful Aggregate Update MySQL Batch Continuous Transformation requires information about the structure
  • 34. What's Coming? • ApacheSpark 2.0 • Unification ofthe DataFrame/Dataset & *ContextAPIs • Basic streaming API • Event-time aggregations • ApacheSpark 2.1+ • Other streaming sources / sinks • Machine learning • Watermarks • Structurein other libraries: MLlib, GraphFrames 34