SlideShare a Scribd company logo
Ioana Delaney, Jia Li
Spark Technology Center, IBM
Extending Spark SQL Data Sources
APIs with Join Push Down
#EUdev7
About the speakers
• Ioana Delaney
– Spark Technology Center, IBM
– DB2 Optimizer developer working in the areas of query semantics, rewrite, and
optimizer.
– Worked on various releases of DB2 LUW and DB2 with BLU Acceleration
– Apache Spark SQL Contributor
• Jia Li
– Spark Technology Center, IBM
– Apache Spark SQL Contributor
– Worked on various releases of IBM BigInsights and IBM Optim Query Workload
Tuner
2#EUdev7
IBM Spark Technology Center
3#EUdev7
• Founded in 2015
• Location: 505 Howard St., San Francisco
• Web: http://spark.tc
• Twitter: @apachespark_tc
• Mission:
– Contribute intellectual and technical capital to
the Apache Spark community.
– Make the core technology enterprise and
cloud-ready.
– Build data science skills to drive intelligence
into business applications
http://bigdatauniversity.com
Motivation
• Spark applications often directly query external data sources such as relational
databases or files
• Spark provides Data Sources APIs for accessing structured data through Spark SQL
• Support optimizations such as Filter push down and Column pruning - subset of the
functionality that can be pushed down to some data sources
• This work extends Data Sources APIs with join push down
• Join push down is nothing more than Selection and Projection push down
• Significantly improves query performance by reducing the amount of data transfer
and exploiting the capabilities of the data sources such as index access
4#EUdev7
Data Sources APIs
• Mechanism for accessing external data sources
• Built-in libraries (Hive, Parquet, Json, JDBC, etc.) and external, third-party libraries
through its spark-packages
• Users can read/save data through generic load()/save() functions, or relational SQL
tables i.e. DataFrames with persistent metadata and names
• Developers can build new libraries for various data sources i.e. to create a new data
source one needs to specify the schema and how to get the data
• Tightly integrated with Spark’s Catalyst Optimizer by allowing optimizations to be
pushed down to the data source
5#EUdev7
Use Case 1: Enterprise-wide data pipeline using Spark
6#EUdev7
• As data gets distributed from different data
sources throughout an organization, there is
a need to create a pipeline to connect these
sources.
• Spark framework extracts the records from
the backing store, caches the DataFrame
results, and then performs joins with the
updates coming from the stream
• e.g. Spark at Bloomberg – Summit 2016
Backing
Data Store
Real-Time
Stream
Data
Frame U1 U2 U3 …
(Avro de-serialize)
Use Case 2: SQL Acceleration with Spark
7#EUdev7
• e.g. RDBMS MPP cluster nodes are overlaid
with local Spark executor processes
• The data frames in Spark are derived from
the existing data partitions of the database
cluster.
• The co-location of executors with the
database engine processes minimizes the
latency of accessing the data
• IBM’s dashDB and other vendors provide
SQL acceleration through co-location or
other Spark integrated architectures
Use Case 3: ETL from legacy data sources
8#EUdev7
• Common pattern in Big Data
space is offloading relational
databases and files into
Hadoop and other data sources
JDBC
Data
Source
Extract,
Transform,
Load Hadoop
Files
Cassandra
Hadoop
Cassandra
Spark as a Unified Analytics Platform for Data Federation
• Ingests data from disparate data sources and perform fast in-memory
analytics on their combined data
• Supports a wide variety of structured and unstructured data sources
• Applies analytics everywhere without the need of data centralization
9#EUdev7
Challenges: Network speed
10#EUdev7
• e.g. large tables in DBMS and selective join in Spark
• Spark reads the entire table, applies some predicates at the data source, and
executes the join locally
• Instead, push the join execution to the data source
Ø Reduce the amount of data transfer
Ø The query runs 40x faster!
Challenges: Exploit data source capabilities
11#EUdev7
• e.g. DBMS has indexes
• Join execution in Spark ~ O(|L|)
• Join execution in DBMS ~ O(|S|)
• Instead, push the join execution to the data source
Ø Efficient join execution using index access
Ø Reduce the data transfer
Ø The query runs 100x faster!
Join push down to the data source
• Alternative execution strategy in Spark
• Nothing more than Selection and Projection push down
• May reduce the amount of data transfer
• Provides Spark with the functions of the underlying data source
• Small API changes with significant performance improvement
12#EUdev7
Push down based on cost vs. heuristics
• Acceptable performance is the most significant concern about data federation
• Query Optimizer determines the best execution of a SQL query e.g. implementation
of relational operators, order of execution, etc.
• In a federated environment, the optimizer must also decide whether the different
operations should be done by the federated server, e.g. Spark, or by the data source
– Needs knowledge of what each data source can do (e.g. file system vs. RDBMS), and how
much it costs (e.g. statistics from data source, network speed, etc.)
• Spark’s Catalyst Optimizer uses a combination of heuristics and cost model
• Cost model is an evolving feature in Spark
• Until federated cost model is fully implemented, use safe heuristics
13#EUdev7
Minimize data transfer
14#EUdev7
Table diagram:
date_dim
store_sales store
N : 1
1:N
Filtering joins (e.g. Star-joins) Expanding joins
Table diagram:
date_dim
store_sales store
N : 1
1:N
M : N
store_returns
select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_prof it
from store_sales, date_dim, store
where d_moy = 4 and d_year = 2001 and
d_date_sk = ss_sold_date_sk and
s_store_sk = ss_store_sk
group by s_store_id ,s_store_name
order by s_store_id ,s_store_name
limit 100
select s_store_id ,s_store_name, sum(ss_net_profit) as store_sales_prof it, sum(ss_net_loss) as store_loss
from store_sales, date_dim, store, store_returns
where d_moy = 4 and d_year = 2001 and
d_date_sk = ss_sold_date_sk and
s_store_sk = ss_store_sk and
ss_ticket_number = sr_ticket_number and
ss_item_sk = sr_item_sk
group by s_store_id ,s_store_name
order by s_store_id ,s_store_name
limit 100
Star-schema joins
• Joins among tables in a star-schema relationship
• Star-schema is the simplest form of a data warehouse schema
• Star-schema model consists of a fact table referencing a number of dimension tables
• Fact and dimension tables are in a primary key – foreign key relationship.
• Star joins are filters applied to the fact table
• Catalyst recognizes star-schema joins
• We use this information to detect and push down star joins to the data source
15#EUdev7
Join re-ordering based on data source
16#EUdev7
• Maximize the amount of functionality that can be pushed down to the data source
• Extends Catalyst’s join enumeration rules to re-order joins based on data source
• Important alternative execution plan for global federated optimization
select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_prof it
from store_sales, customer, customer_address, date_dim, store
where d_moy = 4 and d_year = 2001 and
d_date_sk = ss_sold_date_sk and
sr_store_sk = ss_store_sk
group by s_store_id ,s_store_name order by s_store_id ,s_store_name
limit 100
Managing parallelism: Single vs. multiple partitions join
17
#EUdev7
• Partitions are the basic unit of parallelism in Spark
• JDBC options to specify data partitioning: partitionColumn, lowerbound, upperbound,
and numPartitions
• Spark splits the table read across numPartitions tasks with a stride specified by
lowerbound and upperbound
CREATE TABLE STORE_SALES USING org.apache.spark.sql.jdbc OPTIONS
(. . .
dbtable 'TPCDS.STORE_SALES',
columnName=“ss_sold_date_sk”
lowerBound=2415022,
upperBound=2488020
numpartitions = 10
. . .
select d_moy ,sum(ss_net_profit) as store_sales_profit
from store_sales, date_dim
where d_year = 2001 and
d_date_sk = ss_sold_date_sk
group by d_moy
order by d_moy
How to partition when joins are pushed down?
18#EUdev7
1) Push down partitioned, co-located joins to
data source
– A co-located join occurs locally on the
database partition on which the data
resides.
– Partitioning in Spark aligns with the data
source partitioning
– Hard to achieve for multi-way joins
– Cost based decision
How to partition when joins are pushed down?
19#EUdev7
2) Perform partitioning in Spark i.e. choose a join method that favors parallelism
– Broadcast Hash Join is the preferred method when one of the tables is small
– If the large table comes from a single partition JDBC connection, the execution is serialized
on that single partition
– In such cases, Shuffle Hash Join and Sort Merge Join may outperform Broadcast Hash Join
Broadcast Hash join to single partition Shuffle Hash join Sort Merge join
1 TB TPC-DS Performance Results
20#EUdev7
• Proxy of a real data warehouse
• Retail product supplier e.g. retail sales, web, catalog, etc.
• Ad-hoc, reporting, and data mining type of queries
• Mix of two data sources: IBM DB2/JDBC and Parquet
Cluster: 4-node cluster, each node having:
12 2 TB disks,
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM
Number of cores: 48
Apache Hadoop 2.7.3, Apache Spark 2.2 main (August, 2017)
Database info:
Schema: TPCDS
Scale factor: 1TB total space
Mix of Parquet and DB2/JDBC data sources
DB2 DPF info: 4-node cluster, each node having:
10 2 TB disks,
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM
Number of cores: 48
TPC-DS
Query
spark-2.2
(mins)
spark-2.2-jpd
(mins)
Speedup
Q8 32 4 8x
Q13 121 5 25x
Q15 4 2 2x
Q17 77 7 11x
Q19 42 4 11x
Q25 153 7 21x
Q29 81 7 11x
Q42 31 3 10x
Q45 14 3 4x
Q46 61 5 12x
Q48 155 5 31x
Q52 31 5 6x
Q55 31 4 8x
Q68 69 4 17x
Q74 47 23 2x
Q79 63 4 15x
Q85 22 2 11x
Q89 55 4 14x
Data Sources APIs for reading the data
• BaseRelation: The abstraction of a collection of tuples read from the data
source. It provides the schema of the data.
• TableScan: Reads all the data in the data source.
• PrunedScan: Eliminates unneeded columns at the data source.
• PrunedFilteredScan: Applies predicates at the data source.
• PushDownScan:New API that applies complex operations such as joins at
the data source.
21#EUdev7
PushDownScan APIs
• Trait for a BaseRelation
• Used with data sources that support complex functionality such as joins
• Extends PrunedFilteredScan with DataSourceCapabilities
• Methods needed to be overriden:
– def buildScan(columns: Array[String],
filters: Array[Filter],
tables: Seq[BaseRelation]):RDD[Row]
– def getDataSource(): String
• DataSourceCapabilities trait to model data source characteristics e.g. type of joins
22#EUdev7
Future work: Cost Model for Data Source APIs
• Transform Catalyst into a global optimizer
• Global optimizer generates an optimal execution plan across all data sources
• Determines where an operation should be evaluated based on:
1. The cost to execute the operation.
2. The cost to transfer data between Spark and the data sources
• Key factors that affect global optimization:
– Remote table statistics (e.g. number of rows, number of distinct values in each
column, etc)
– Data source characteristics (e.g. CPU speed, I/O rate, network speed, etc.)
• Extend Data Source APIs with data source characteristics
• Retrieve/compute data source table statistics
• Integrate data source cost model into Catalyst
23#EUdev7
Thank You.
Ioana Delaney ursu@us.ibm.com
Jia Li jiali@us.ibm.com
Visit http://spark.tc

More Related Content

PDF
Productizing Structured Streaming Jobs
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PPTX
Optimizing Apache Spark SQL Joins
PPTX
The Pushdown of Everything by Stephan Kessler and Santiago Mola
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Productizing Structured Streaming Jobs
The Parquet Format and Performance Optimization Opportunities
Common Strategies for Improving Performance on Your Delta Lakehouse
Optimizing Apache Spark SQL Joins
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Optimizing Delta/Parquet Data Lakes for Apache Spark
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

What's hot (20)

PDF
Introduction to apache spark
PDF
Making Apache Spark Better with Delta Lake
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
The delta architecture
PDF
Spark shuffle introduction
PDF
Data Privacy with Apache Spark: Defensive and Offensive Approaches
PDF
Apache Iceberg: An Architectural Look Under the Covers
PDF
Physical Plans in Spark SQL
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PPTX
Presto best practices for Cluster admins, data engineers and analysts
PDF
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
SQL Performance Improvements At a Glance in Apache Spark 3.0
PDF
Change Data Feed in Delta
PDF
Hive Bucketing in Apache Spark with Tejas Patil
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Introduction to apache spark
Making Apache Spark Better with Delta Lake
A Deep Dive into Query Execution Engine of Spark SQL
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
The delta architecture
Spark shuffle introduction
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Apache Iceberg: An Architectural Look Under the Covers
Physical Plans in Spark SQL
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Presto best practices for Cluster admins, data engineers and analysts
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
SQL Performance Improvements At a Glance in Apache Spark 3.0
Change Data Feed in Delta
Hive Bucketing in Apache Spark with Tejas Patil
Large Scale Lakehouse Implementation Using Structured Streaming
Apache Spark Core—Deep Dive—Proper Optimization
Ad

Viewers also liked (6)

PDF
Spark Pipelines in the Cloud with Alluxio with Gene Pang
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
PDF
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Ad

Similar to Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana Delaney and Jia Li (20)

PPTX
2018 data warehouse features in spark
PDF
Data processing with spark in r & python
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Apache Spark Presentation good for big data
PDF
Not Your Father's Database by Databricks
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PPTX
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
PDF
Spark and Couchbase: Augmenting the Operational Database with Spark
PPTX
big data analytics (BAD601) Module-5.pptx
PPTX
This is training for spark SQL essential
PPTX
Large Scale Machine learning with Spark
PDF
SQL Engines for Hadoop - The case for Impala
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
DoneDeal - AWS Data Analytics Platform
PPTX
The Future of Hadoop: A deeper look at Apache Spark
2018 data warehouse features in spark
Data processing with spark in r & python
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Apache Spark Presentation good for big data
Not Your Father's Database by Databricks
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Spark and Couchbase: Augmenting the Operational Database with Spark
big data analytics (BAD601) Module-5.pptx
This is training for spark SQL essential
Large Scale Machine learning with Spark
SQL Engines for Hadoop - The case for Impala
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
DoneDeal - AWS Data Analytics Platform
The Future of Hadoop: A deeper look at Apache Spark

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Report The-State-of-AIOps 20232032 3.pdf
PPTX
1_Introduction to advance data techniques.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Logistic Regression ml machine learning.pptx
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
PPTX
Global journeys: estimating international migration
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PDF
Data Science Trends & Career Guide---ppt
PPTX
Challenges and opportunities in feeding a growing population
PDF
Clinical guidelines as a resource for EBP(1).pdf
Report The-State-of-AIOps 20232032 3.pdf
1_Introduction to advance data techniques.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Launch Your Data Science Career in Kochi – 2025
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction to Knowledge Engineering Part 1
climate analysis of Dhaka ,Banglades.pptx
Business Acumen Training GuidePresentation.pptx
Logistic Regression ml machine learning.pptx
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
Global journeys: estimating international migration
.pdf is not working space design for the following data for the following dat...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Data Science Trends & Career Guide---ppt
Challenges and opportunities in feeding a growing population
Clinical guidelines as a resource for EBP(1).pdf

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana Delaney and Jia Li

  • 1. Ioana Delaney, Jia Li Spark Technology Center, IBM Extending Spark SQL Data Sources APIs with Join Push Down #EUdev7
  • 2. About the speakers • Ioana Delaney – Spark Technology Center, IBM – DB2 Optimizer developer working in the areas of query semantics, rewrite, and optimizer. – Worked on various releases of DB2 LUW and DB2 with BLU Acceleration – Apache Spark SQL Contributor • Jia Li – Spark Technology Center, IBM – Apache Spark SQL Contributor – Worked on various releases of IBM BigInsights and IBM Optim Query Workload Tuner 2#EUdev7
  • 3. IBM Spark Technology Center 3#EUdev7 • Founded in 2015 • Location: 505 Howard St., San Francisco • Web: http://spark.tc • Twitter: @apachespark_tc • Mission: – Contribute intellectual and technical capital to the Apache Spark community. – Make the core technology enterprise and cloud-ready. – Build data science skills to drive intelligence into business applications http://bigdatauniversity.com
  • 4. Motivation • Spark applications often directly query external data sources such as relational databases or files • Spark provides Data Sources APIs for accessing structured data through Spark SQL • Support optimizations such as Filter push down and Column pruning - subset of the functionality that can be pushed down to some data sources • This work extends Data Sources APIs with join push down • Join push down is nothing more than Selection and Projection push down • Significantly improves query performance by reducing the amount of data transfer and exploiting the capabilities of the data sources such as index access 4#EUdev7
  • 5. Data Sources APIs • Mechanism for accessing external data sources • Built-in libraries (Hive, Parquet, Json, JDBC, etc.) and external, third-party libraries through its spark-packages • Users can read/save data through generic load()/save() functions, or relational SQL tables i.e. DataFrames with persistent metadata and names • Developers can build new libraries for various data sources i.e. to create a new data source one needs to specify the schema and how to get the data • Tightly integrated with Spark’s Catalyst Optimizer by allowing optimizations to be pushed down to the data source 5#EUdev7
  • 6. Use Case 1: Enterprise-wide data pipeline using Spark 6#EUdev7 • As data gets distributed from different data sources throughout an organization, there is a need to create a pipeline to connect these sources. • Spark framework extracts the records from the backing store, caches the DataFrame results, and then performs joins with the updates coming from the stream • e.g. Spark at Bloomberg – Summit 2016 Backing Data Store Real-Time Stream Data Frame U1 U2 U3 … (Avro de-serialize)
  • 7. Use Case 2: SQL Acceleration with Spark 7#EUdev7 • e.g. RDBMS MPP cluster nodes are overlaid with local Spark executor processes • The data frames in Spark are derived from the existing data partitions of the database cluster. • The co-location of executors with the database engine processes minimizes the latency of accessing the data • IBM’s dashDB and other vendors provide SQL acceleration through co-location or other Spark integrated architectures
  • 8. Use Case 3: ETL from legacy data sources 8#EUdev7 • Common pattern in Big Data space is offloading relational databases and files into Hadoop and other data sources JDBC Data Source Extract, Transform, Load Hadoop Files Cassandra Hadoop Cassandra
  • 9. Spark as a Unified Analytics Platform for Data Federation • Ingests data from disparate data sources and perform fast in-memory analytics on their combined data • Supports a wide variety of structured and unstructured data sources • Applies analytics everywhere without the need of data centralization 9#EUdev7
  • 10. Challenges: Network speed 10#EUdev7 • e.g. large tables in DBMS and selective join in Spark • Spark reads the entire table, applies some predicates at the data source, and executes the join locally • Instead, push the join execution to the data source Ø Reduce the amount of data transfer Ø The query runs 40x faster!
  • 11. Challenges: Exploit data source capabilities 11#EUdev7 • e.g. DBMS has indexes • Join execution in Spark ~ O(|L|) • Join execution in DBMS ~ O(|S|) • Instead, push the join execution to the data source Ø Efficient join execution using index access Ø Reduce the data transfer Ø The query runs 100x faster!
  • 12. Join push down to the data source • Alternative execution strategy in Spark • Nothing more than Selection and Projection push down • May reduce the amount of data transfer • Provides Spark with the functions of the underlying data source • Small API changes with significant performance improvement 12#EUdev7
  • 13. Push down based on cost vs. heuristics • Acceptable performance is the most significant concern about data federation • Query Optimizer determines the best execution of a SQL query e.g. implementation of relational operators, order of execution, etc. • In a federated environment, the optimizer must also decide whether the different operations should be done by the federated server, e.g. Spark, or by the data source – Needs knowledge of what each data source can do (e.g. file system vs. RDBMS), and how much it costs (e.g. statistics from data source, network speed, etc.) • Spark’s Catalyst Optimizer uses a combination of heuristics and cost model • Cost model is an evolving feature in Spark • Until federated cost model is fully implemented, use safe heuristics 13#EUdev7
  • 14. Minimize data transfer 14#EUdev7 Table diagram: date_dim store_sales store N : 1 1:N Filtering joins (e.g. Star-joins) Expanding joins Table diagram: date_dim store_sales store N : 1 1:N M : N store_returns select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_prof it from store_sales, date_dim, store where d_moy = 4 and d_year = 2001 and d_date_sk = ss_sold_date_sk and s_store_sk = ss_store_sk group by s_store_id ,s_store_name order by s_store_id ,s_store_name limit 100 select s_store_id ,s_store_name, sum(ss_net_profit) as store_sales_prof it, sum(ss_net_loss) as store_loss from store_sales, date_dim, store, store_returns where d_moy = 4 and d_year = 2001 and d_date_sk = ss_sold_date_sk and s_store_sk = ss_store_sk and ss_ticket_number = sr_ticket_number and ss_item_sk = sr_item_sk group by s_store_id ,s_store_name order by s_store_id ,s_store_name limit 100
  • 15. Star-schema joins • Joins among tables in a star-schema relationship • Star-schema is the simplest form of a data warehouse schema • Star-schema model consists of a fact table referencing a number of dimension tables • Fact and dimension tables are in a primary key – foreign key relationship. • Star joins are filters applied to the fact table • Catalyst recognizes star-schema joins • We use this information to detect and push down star joins to the data source 15#EUdev7
  • 16. Join re-ordering based on data source 16#EUdev7 • Maximize the amount of functionality that can be pushed down to the data source • Extends Catalyst’s join enumeration rules to re-order joins based on data source • Important alternative execution plan for global federated optimization select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_prof it from store_sales, customer, customer_address, date_dim, store where d_moy = 4 and d_year = 2001 and d_date_sk = ss_sold_date_sk and sr_store_sk = ss_store_sk group by s_store_id ,s_store_name order by s_store_id ,s_store_name limit 100
  • 17. Managing parallelism: Single vs. multiple partitions join 17 #EUdev7 • Partitions are the basic unit of parallelism in Spark • JDBC options to specify data partitioning: partitionColumn, lowerbound, upperbound, and numPartitions • Spark splits the table read across numPartitions tasks with a stride specified by lowerbound and upperbound CREATE TABLE STORE_SALES USING org.apache.spark.sql.jdbc OPTIONS (. . . dbtable 'TPCDS.STORE_SALES', columnName=“ss_sold_date_sk” lowerBound=2415022, upperBound=2488020 numpartitions = 10 . . . select d_moy ,sum(ss_net_profit) as store_sales_profit from store_sales, date_dim where d_year = 2001 and d_date_sk = ss_sold_date_sk group by d_moy order by d_moy
  • 18. How to partition when joins are pushed down? 18#EUdev7 1) Push down partitioned, co-located joins to data source – A co-located join occurs locally on the database partition on which the data resides. – Partitioning in Spark aligns with the data source partitioning – Hard to achieve for multi-way joins – Cost based decision
  • 19. How to partition when joins are pushed down? 19#EUdev7 2) Perform partitioning in Spark i.e. choose a join method that favors parallelism – Broadcast Hash Join is the preferred method when one of the tables is small – If the large table comes from a single partition JDBC connection, the execution is serialized on that single partition – In such cases, Shuffle Hash Join and Sort Merge Join may outperform Broadcast Hash Join Broadcast Hash join to single partition Shuffle Hash join Sort Merge join
  • 20. 1 TB TPC-DS Performance Results 20#EUdev7 • Proxy of a real data warehouse • Retail product supplier e.g. retail sales, web, catalog, etc. • Ad-hoc, reporting, and data mining type of queries • Mix of two data sources: IBM DB2/JDBC and Parquet Cluster: 4-node cluster, each node having: 12 2 TB disks, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM Number of cores: 48 Apache Hadoop 2.7.3, Apache Spark 2.2 main (August, 2017) Database info: Schema: TPCDS Scale factor: 1TB total space Mix of Parquet and DB2/JDBC data sources DB2 DPF info: 4-node cluster, each node having: 10 2 TB disks, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM Number of cores: 48 TPC-DS Query spark-2.2 (mins) spark-2.2-jpd (mins) Speedup Q8 32 4 8x Q13 121 5 25x Q15 4 2 2x Q17 77 7 11x Q19 42 4 11x Q25 153 7 21x Q29 81 7 11x Q42 31 3 10x Q45 14 3 4x Q46 61 5 12x Q48 155 5 31x Q52 31 5 6x Q55 31 4 8x Q68 69 4 17x Q74 47 23 2x Q79 63 4 15x Q85 22 2 11x Q89 55 4 14x
  • 21. Data Sources APIs for reading the data • BaseRelation: The abstraction of a collection of tuples read from the data source. It provides the schema of the data. • TableScan: Reads all the data in the data source. • PrunedScan: Eliminates unneeded columns at the data source. • PrunedFilteredScan: Applies predicates at the data source. • PushDownScan:New API that applies complex operations such as joins at the data source. 21#EUdev7
  • 22. PushDownScan APIs • Trait for a BaseRelation • Used with data sources that support complex functionality such as joins • Extends PrunedFilteredScan with DataSourceCapabilities • Methods needed to be overriden: – def buildScan(columns: Array[String], filters: Array[Filter], tables: Seq[BaseRelation]):RDD[Row] – def getDataSource(): String • DataSourceCapabilities trait to model data source characteristics e.g. type of joins 22#EUdev7
  • 23. Future work: Cost Model for Data Source APIs • Transform Catalyst into a global optimizer • Global optimizer generates an optimal execution plan across all data sources • Determines where an operation should be evaluated based on: 1. The cost to execute the operation. 2. The cost to transfer data between Spark and the data sources • Key factors that affect global optimization: – Remote table statistics (e.g. number of rows, number of distinct values in each column, etc) – Data source characteristics (e.g. CPU speed, I/O rate, network speed, etc.) • Extend Data Source APIs with data source characteristics • Retrieve/compute data source table statistics • Integrate data source cost model into Catalyst 23#EUdev7
  • 24. Thank You. Ioana Delaney [email protected] Jia Li [email protected] Visit http://spark.tc