© 2016 IBM Corporation
Using your DB2 skills with Hadoop and Spark
Presented to TRIDEX DB2 Users Group, June 2017
C. M. Saracco, IBM Silicon Valley Lab
https://www.slideshare.net/CynthiaSaracco/presentations
© 2016 IBM Corporation2
Executive summary
§ About Apache Hadoop and Spark
− Popular open source technologies for working with Big Data
• Clustered computing > scalability
• Varied data > no pre-set structure or schema requirements
− Hadoop: distributed file system (storage), MapReduce API, . . .
− Spark: in-memory data processing (speed), built-in libraries, . . .
§ About Big SQL
− DB2-compatible query engine for Hadoop data (IBM or Hortonworks distributions)
− Based on decades of IBM R&D investment in RDBMS technology, including database
parallelism and query optimization. Strong runtime performance for analytical workloads.
§ Some ways to leverage DB2 SQL skills
− Create / manage / query “local” or distributed tables in Hadoop
− Query / join Hadoop data with DB2, Oracle, Teradata, etc. data via query federation
− Leverage Spark to query and manipulate Big SQL or DB2 data
− Leverage Big SQL to initiate Spark jobs and analyze result
© 2016 IBM Corporation3
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary
© 2016 IBM Corporation4
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary
© 2016 IBM Corporation5
Business leaders frequently make
decisions based on information they
don’ttrust, or don’t have1in3
83%
of CIOs cited “Business
intelligence and analytics” as part
of their visionary plans
to enhance competitiveness
Business leaders say they don’t
have access to the information they
need to do their jobs
1in2
of CEOs need to do a better job
capturing and understanding
information rapidly in order to
make swift business decisions
60%
… and organizations
need deeper insights
Information is at the center
of a new wave of opportunity…
4 million “likes” per
minute
300,000 tweets
per minute
150 million emails
per minute 2.78 million video
views per minute
2.5 TB per day
per A350 plane
> 1 PB per day
gas turbines
1 ZB = 1 billion TB
© 2016 IBM Corporation6
Big Data adoption (study results)
2012 to 2014 2015
22%-27% 25% 0%
change
2012 to 2014 2015
24%-26% 10% 250%
decrease
Educate:
Learning about
big data capabilities
2012 to 2014 2015
43%-47% 53% 125%
increase
Explore:
Exploring internal use cases and
developing a strategy
Engage:
Implementing infrastructure and
running pilot activities
2012 to 2014 2015
5%-6% 13% 210%
increase
Execute:
Using big data and analytics
pervasively across the enterprise
2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)
© 2016 IBM Corporation7
Return on investment period for big data and analytics projects
as reported by respondents
Big Data ROI often < 18 months
2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)
© 2016 IBM Corporation8
§ Both open source Apache projects
− Exploit distributed computing environments
− Enable processing of large volumes of varied data
§ Hadoop
− Inspired by Google technologies (MapReduce, GFS)
− Originally designed for batch-oriented, read-intensive applications
− “Core” consists of distributed file system, MapReduce, job scheduler, utilities
− Complementary projects span data warehousing, workflow management,
columnar data storage, activity monitoring, . . .
§ Spark
− Began as a UC Berkeley project
− Fast, general-purpose engine for working with Big Data in memory
− Popular built-in libraries for machine learning, streaming data, query (SQL), . . .
− No built-in storage. Interfaces to Hadoop, other stores
About Hadoop and Spark
© 2016 IBM Corporation9
IBM contributions: Hadoop and Spark
Snapshots taken Jan. 2017.
Latest content available online
via Apache dashboards.
IOP relates to Hadoop; STC
relates to Spark.
© 2016 IBM Corporation10
What is Big SQL?
SQL-based
Application
Big SQL Engine
Data Storage
IBM data server
client
SQL MPP Run-time
HDFS
§ Comprehensive, standard SQL for Hadoop
– SELECT: joins, unions, aggregates, subqueries . . .
– UPDATE/DELETE (HBase-managed tables)
– GRANT/REVOKE, INSERT … INTO
– SQL procedural logic (SQL PL)
– Stored procs, user-defined functions
– IBM data server JDBC and ODBC drivers
§ Optimization and performance
– IBM MPP engine (C++) replaces Java MapReduce layer
– Continuous running daemons (no start up latency)
– Message passing allow data to flow between nodes
without persisting intermediate results
– In-memory operations with ability to spill to disk (useful
for aggregations, sorts that exceed available RAM)
– Cost-based query optimization with 140+ rewrite rules
§ Various storage formats supported
– Text (delimited), Sequence, RCFile, ORC, Avro, Parquet
– Data persisted in DFS, Hive, HBase
– No IBM proprietary format required
§ Integration with RDBMSs via LOAD, query
federation
IBM Open Platform or
Hortonworks Data Platform
© 2016 IBM Corporation11
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary
© 2016 IBM Corporation12
§ Big SQL
− Easy on-ramp to Hadoop for DB2 SQL professionals
− Create query-ready data lake
− Offload “cold” RDBMS warehouse data to Hadoop
− . . . .
§ Some ways to use Big SQL . . .
− Create tables
− Load / insert data
− Execute complex queries
− Exploit various DB2 features: UDFs, EXPLAIN, workload management, Oracle /
Netezza SQL compatibility. . . .
− Exploit various Hadoop features: Hive, HBase, SerDes, . . .
About Hadoop and Big SQL
© 2016 IBM Corporation13
Invocation options
§ Command-line interface:
Java SQL Shell (JSqsh)
§ Web tooling (Data Server
Manager)
§ Tools that support IBM
JDBC/ODBC driver
© 2016 IBM Corporation14
Creating a Big SQL table
§ Standard CREATE TABLE DDL with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;
Worth noting:
• “Hadoop” keyword creates table in HDFS
• Row format delimited and textfile formats are default
• Constraints not enforced (but useful for query optimization)
• Examples in these charts focus on HDFS storage, both within or external to Hive
warehouse. HBase examples provided separately
© 2016 IBM Corporation15
CREATE VIEW
§ Standard SQL syntax
create view my_users as
select fname, lname from biadmin.users where id > 100;
© 2016 IBM Corporation16
Populating tables via LOAD
§ Typically best runtime performance
§ Load data from local or remote file system
load hadoop using file url
'sftp://myID:myPassword@myServer.ibm.com:22/install-
dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES
('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
§ Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL,
Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);
© 2016 IBM Corporation17
Populating tables via INSERT
§ INSERT INTO . . . SELECT FROM . . .
− Parallel read and write operations
CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet
( product_key INT NOT NULL, product_name VARCHAR(150),
Quantity INT, order_method_en VARCHAR(90) )
STORED AS parquetfile;
-- source tables do not need to be in Parquet format
insert into big_sales_parquet
SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en
FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key
and sales.quantity > 5500;
§ INSERT INTO . . . VALUES(. . . )
− Not parallelized. 1 file per INSERT. Not recommended except for quick tests
CREATE HADOOP TABLE foo col1 int, col2 varchar(10));
INSERT INTO foo VALUES (1, ‘hello’);
© 2016 IBM Corporation18
CREATE . . . TABLE . . . AS SELECT . . .
§ Create a Big SQL table based on contents of other table(s)
§ Source tables can be in different file formats or use different
underlying storage mechanisms
-- source tables in this example are external (just DFS files)
CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_line_en VARCHAR(90)
, product_line_de VARCHAR(90)
)
as select product_key, d.product_line_code, product_type_key,
product_type_code, product_line_en, product_line_de
from extern.sls_product_dim d, extern.sls_product_line_lookup l
where d.product_line_code = l.product_line_code;
© 2016 IBM Corporation19
SQL capability highlights
§ Query operations
− Projections, restrictions
− UNION, INTERSECT, EXCEPT
− Wide range of built-in functions (e.g. OLAP)
− Various Oracle, Netezza compatibility items
§ Full support for subqueries
− In SELECT, FROM, WHERE and
HAVING clauses
− Correlated and uncorrelated
− Equality, non-equality subqueries
− EXISTS, NOT EXISTS, IN, ANY,
SOME, etc.
§ All standard join operations
− Standard and ANSI join syntax
− Inner, outer, and full outer joins
− Equality, non-equality, cross join support
− Multi-value join
§ Stored procedures, user-defined
functions, user-defined aggregates
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name;
© 2016 IBM Corporation20
Power of standard SQL
§ Big SQL executes all 22 TPC-H queries without modification
§ Big SQL executes all 99 TPC-DS queries without modification
§ Big SQL leverages DB2 query rewrite technology for efficient optimization
SELECT s_name, count(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (
SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM orders o
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM nation n
JOIN supplier s
ON s.s_nationkey = n.n_nationkey
AND n.n_name = 'INDONESIA'
JOIN lineitem l
ON s.s_suppkey = l.l_suppkey
WHERE l.l_receiptdate > l.l_commitdate) l1
ON o.o_orderkey = l1.l_orderkey
AND o.o_orderstatus = 'F') l2
ON l2.l_orderkey = t1.l_orderkey) a
WHERE (count_suppkey > 1) or ((count_suppkey=1)
AND (l_suppkey <> max_suppkey))) l3
ON l3.l_orderkey = t2.l_orderkey) b
WHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c
GROUP BY s_name
ORDER BY numwait DESC, s_name
SELECT s_name, count(1) AS numwait
FROM
(SELECT s_name FROM
(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
WHERE l_receiptdate > l_commitdate
GROUP BY l_orderkey) t2
RIGHT OUTER JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM
(SELECT s_name, t1.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
GROUP BY l_orderkey) t1
Original Query
Re-written query
© 2016 IBM Corporation21
Query federation = virtualized data access
Transparent
§ Appears to be one source
§ Programmers don’t need to know how /
where data is stored
Heterogeneous
§ Accesses data from diverse sources
High Function
§ Full query support against all data
§ Capabilities of sources as well
Autonomous
§ Non-disruptive to data sources, existing
applications, systems.
High Performance
§ Optimization of distributed queries
SQL tools,
applications Data sources
Virtualized
data
© 2016 IBM Corporation22
Federation in practice
§ Admin enables
federation
§ Apps connect to Big
SQL database
§ Nicknames look like
tables to the app
§ Big SQL optimizer
creates global data
access plan with cost
analysis, query push
down
§ Query fragments
executed remotely
Nickname
Nickname
Table
Cost-based optimizer
Wrapper
Client library
Wrapper
Client library
Local + Remote
Execution Plans
Remote sources
Federation server (Big SQL)
Native dialect
Connect to bigsql
© 2016 IBM Corporation23
Joining data across sources
© 2016 IBM Corporation24
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary
© 2016 IBM Corporation25
About Spark and Big SQL
§ Easy to query Big SQL (or DB2 LUW) tables through Spark SQL
− See link to self-study lab in “Resources” section
§ Follow typical Spark SQL JDBC data source pattern
− Identify JDBC driver and connection properties
− Load table contents into DataFrame, Spark SQL temporary view
− Execute Spark SQL queries
− Applies to Big SQL tables in Hive warehouse, HBase, or arbitrary HDFS
directory
− Query results can be manipulated via other Spark libraries
§ Technical preview: Launch Spark jobs from Big SQL via UDF
© 2016 IBM Corporation26
Accessing Big SQL data from Spark shell
// based on BigInsights tech preview release that includes Spark 2.1
// Launch shell with --driver-class-path pointing to JDBC driver .jar
// read data from Big SQL table “t1” and load into a DataFrame
val sampleDF = spark.read.format("jdbc")
.option("url”,"jdbc:db2://yourHost.com:32051/BIGSQL")
.option("dbtable",”yourSchema.t1")
.option("user", "yourID").option("password", "yourPassword")
.load()
// display full contents
sampleDF.show()
// create a Spark SQL temporary view to query
sampleDF.createOrReplaceTempView("v1")
// query the view and display the results
sql("select col1, col3 from v1 where col2 > 100 limit 15”).show()
© 2016 IBM Corporation27
Technical preview: launch Spark jobs from Big SQL
§ Spark jobs can be invoked from Big SQL using a table UDF
abstraction
§ Example: Call the SYSHADOOP.EXECSPARK built-in UDF to kick
off a Spark job that reads a JSON file stored on HDFS
SELECT *
FROM TABLE(SYSHADOOP.EXECSPARK(
language => 'scala',
class =>
'com.ibm.biginsights.bigsql.examples.ReadJsonFile',
uri =>
'hdfs://host.port.com:8020/user/bigsql/demo.json',
card => 100000)) AS doc
WHERE doc.country IS NOT NULL
© 2016 IBM Corporation28
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/
§ Summary
© 2016 IBM Corporation29
What	is	TPC-DS?
§ TPC	=	Transaction	Processing	Council
− Non-profit	corporation	 (vendor	independent)
− Defines	various	industry	driven	database	benchmarks….	DS	=	Decision	Support
− Models	a	multi-domain	 data	warehouse	environment	 for	a	hypothetical	retailer
Retail Sales Web Sales Inventory Demographics Promotions
Multiple	scale	factors:	
100GB,	300GB,	1TB,	3TB,	10TB,	30TB	and	100TB
99 Pre-Defined
Queries
Query Classes:
Reporting Ad HocIterative
OLAP
Data
Mining
© 2016 IBM Corporation30
100TB	TPC-DS	is	BIGdata
© 2016 IBM Corporation31
Benchmark Environment: IBM “F1” Spark SQL Cluster
§ 28 Nodes Total (Lenovo x3640 M5)
§ Each configured as:
• 2 sockets (18 cores/socket)
• 1.5 TB RAM
• 8x 2TB SSD
§ 2 Racks
− 20x 2U servers per rack (42U racks)
§ 1 Switch, 100GbE, 32 ports Mellanox
SN2700
© 2016 IBM Corporation32
PERFORMANCE
SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE
WORKING QUERIES
COMPRESSION
60%SPACE SAVED
WITH
PARQUET
Spark SQL completes more
TPC-DS queries than any other
open source SQL engine for Hadoop
@ 100TB Scale
© 2016 IBM Corporation33
Query Compliance Through the Scale Factors
§ SQL	compliance	is	important	because	Business	Intelligence	tools	generate	standard	SQL
− Rewriting	queries	is	painful	and	impacts	productivity
§ Spark	SQL	2.1	can	run	all	99	TPC-DS	
queries	but	only	at	lower	scale	factors
§ Spark	SQL	Failures	@	100	TB:
− 12	runtime	errors	
− 4	timeout	(>	10	hours)
Spark	SQL
§ Big	SQL	has	been	successfully	executing	all	
99	queries	since	Oct	2014
§ IBM	is	the	only	vendor	that	has	proven	SQL	
compatibility	at	scale	factors	up	to	100TB
Big	SQL
© 2016 IBM Corporation34
Big SQL is 3.2X faster than Spark 2.1
(4 Concurrent Streams)
Big	SQL	@	99	queries	still
outperforms	 Spark	SQL	@	83	queries
© 2016 IBM Corporation35
PERFORMANCE
Big SQL 3.2x faster
HADOOP-DS @ 100TB: AT A GLANCE
WORKING QUERIES
CPU (vs Spark)
Big SQL uses 3.7x less CPU
I/O (vs Spark)
Big SQL reads 12x less data
Big SQL writes 30x less data
COMPRESSION
60%SPACE SAVED
WITH PARQUET
AVERAGE CPU
USAGE
76.4%
MAX I/O
THROUGHPUT:
READ 4.4 GB/SEC
WRITE 2.8 GB/SEC
© 2016 IBM Corporation36
Recommendation: Right Tool for the Right Job
Machine Learning
Simpler SQL
Good Performance
Ideal tool for BI Data Analysts
and production workloads
Ideal tool for Data Scientists
and discovery
Big SQL Spark SQL
Migrating existing
workloads to Hadoop
Security
Many Concurrent Users
Best Performance
Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster
© 2016 IBM Corporation37
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary
© 2016 IBM Corporation38
Summary
§ Big SQL = easy path for DB2 professionals to work with Big Data
§ Runs on popular Hadoop platforms from IBM, Hortonworks
§ Integrates with Spark
§ Compatible with DB2 and ISO SQL
§ Brings high-performance, enterprise-grade query engine to popular
open source Big Data platforms
© 2016 IBM Corporation39
Want to learn more?
§ Hadoop Dev
https://developer.ibm.com/hadoop/
§ Labs: Big SQL intro, Spark / Big SQL, . . .
https://developer.ibm.com/hadoop/docs/getting-
started/tutorials/big-sql-hadoop-tutorial/
§ 100TB benchmark
https://developer.ibm.com/hadoop/2017/02/07/experiences-
comparing-big-sql-and-spark-sql-at-100tb/
§ This presentation
https://www.slideshare.net/CynthiaSaracco/presentations
© 2016 IBM Corporation40
Supplemental
© 2016 IBM Corporation41
Big SQL architecture
§ Head (coordinator / management) node
− Listens to the JDBC/ODBC connections
− Compiles and optimizes the query
− Optionally store user data in DB2-compatible table (single node only). Useful for some reference data.
§ Big SQL worker processes reside on compute nodes (some or all)
§ Worker nodes stream data between each other as needed
§ Workers can spill large data sets to local disk if needed
− Allows Big SQL to work with data sets larger than available memory
© 2016 IBM Corporation42
CPU Profile for Big SQL vs. Spark SQL
Hadoop-DS @ 100TB, 4 Concurrent Streams
Spark	SQL	uses	almost	3x more	systemCPU.
These	are	wasted	CPU	cycles.
Average	CPU	
Utilization:	76.4%
Average	CPU
Utilization:	88.2%
© 2016 IBM Corporation43
I/O Profile for Big SQL vs. Spark SQL
Hadoop-DS @ 100TB, 4 Concurrent Streams
Spark SQL required
3.6X more reads
9.5X more writes
Big SQL can
drive peak I/O nearly
2X more

More Related Content

PDF
Big Data: Working with Big SQL data from Spark
PDF
Big Data: HBase and Big SQL self-study lab
PDF
Big Data: Getting started with Big SQL self-study guide
PDF
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
PDF
Big SQL 3.0 - Toronto Meetup -- May 2014
PDF
Big Data: Get started with SQL on Hadoop self-study lab
PDF
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
PDF
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: Working with Big SQL data from Spark
Big Data: HBase and Big SQL self-study lab
Big Data: Getting started with Big SQL self-study guide
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
Big SQL 3.0 - Toronto Meetup -- May 2014
Big Data: Get started with SQL on Hadoop self-study lab
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics

What's hot (17)

PDF
Big Data: SQL query federation for Hadoop and RDBMS data
PDF
Big Data: Querying complex JSON data with BigInsights and Hadoop
PDF
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
PDF
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
PDF
Big Data: Big SQL and HBase
PDF
Big Data: SQL on Hadoop from IBM
PDF
Big Data: Explore Hadoop and BigInsights self-study lab
PDF
Big SQL Competitive Summary - Vendor Landscape
PDF
Taming Big Data with Big SQL 3.0
PPT
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
PDF
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
PDF
Getting started with Hadoop on the Cloud with Bluemix
PDF
Big SQL 3.0 - Fast and easy SQL on Hadoop
PPTX
Hadoop Innovation Summit 2014
PDF
SQL on Hadoop
PDF
Running Cognos on Hadoop
PDF
Advanced Security In Hadoop Cluster
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data: Querying complex JSON data with BigInsights and Hadoop
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Big SQL and HBase
Big Data: SQL on Hadoop from IBM
Big Data: Explore Hadoop and BigInsights self-study lab
Big SQL Competitive Summary - Vendor Landscape
Taming Big Data with Big SQL 3.0
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Getting started with Hadoop on the Cloud with Bluemix
Big SQL 3.0 - Fast and easy SQL on Hadoop
Hadoop Innovation Summit 2014
SQL on Hadoop
Running Cognos on Hadoop
Advanced Security In Hadoop Cluster
Ad

Similar to Using your DB2 SQL Skills with Hadoop and Spark (20)

PDF
ESGYN Overview
PDF
Rajeev kumar apache_spark &amp; scala developer
PDF
Big SQL NYC Event December by Virender
PPTX
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
PPTX
Uotm workshop
PDF
Ibm db2 big sql
PDF
Agile data lake? An oxymoron?
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
DOC
SAMADMohammad
PPTX
Run Oracle Apps in the Cloud with dashDB
PDF
Big Data Journey
PDF
Power BI with Essbase in the Oracle Cloud
DOCX
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
DOC
Chris Asano.dba.20160512a
PDF
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
 
PDF
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
 
DOCX
Rama prasad owk etl hadoop_developer
PPTX
Big Data with SQL Server
PPTX
Demystifying Data Warehouse as a Service
PPTX
Professional Portfolio
ESGYN Overview
Rajeev kumar apache_spark &amp; scala developer
Big SQL NYC Event December by Virender
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Uotm workshop
Ibm db2 big sql
Agile data lake? An oxymoron?
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
SAMADMohammad
Run Oracle Apps in the Cloud with dashDB
Big Data Journey
Power BI with Essbase in the Oracle Cloud
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Chris Asano.dba.20160512a
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
 
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
 
Rama prasad owk etl hadoop_developer
Big Data with SQL Server
Demystifying Data Warehouse as a Service
Professional Portfolio
Ad

Recently uploaded (20)

PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
CloudStack 4.21: First Look Webinar slides
PDF
sustainability-14-14877-v2.pddhzftheheeeee
DOCX
search engine optimization ppt fir known well about this
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
STKI Israel Market Study 2025 version august
PDF
August Patch Tuesday
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Unlock new opportunities with location data.pdf
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Getting Started with Data Integration: FME Form 101
WOOl fibre morphology and structure.pdf for textiles
CloudStack 4.21: First Look Webinar slides
sustainability-14-14877-v2.pddhzftheheeeee
search engine optimization ppt fir known well about this
Zenith AI: Advanced Artificial Intelligence
A review of recent deep learning applications in wood surface defect identifi...
A novel scalable deep ensemble learning framework for big data classification...
NewMind AI Weekly Chronicles – August ’25 Week III
Hindi spoken digit analysis for native and non-native speakers
1 - Historical Antecedents, Social Consideration.pdf
Developing a website for English-speaking practice to English as a foreign la...
Univ-Connecticut-ChatGPT-Presentaion.pdf
STKI Israel Market Study 2025 version august
August Patch Tuesday
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Unlock new opportunities with location data.pdf
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Getting Started with Data Integration: FME Form 101

Using your DB2 SQL Skills with Hadoop and Spark

  • 1. © 2016 IBM Corporation Using your DB2 skills with Hadoop and Spark Presented to TRIDEX DB2 Users Group, June 2017 C. M. Saracco, IBM Silicon Valley Lab https://www.slideshare.net/CynthiaSaracco/presentations
  • 2. © 2016 IBM Corporation2 Executive summary § About Apache Hadoop and Spark − Popular open source technologies for working with Big Data • Clustered computing > scalability • Varied data > no pre-set structure or schema requirements − Hadoop: distributed file system (storage), MapReduce API, . . . − Spark: in-memory data processing (speed), built-in libraries, . . . § About Big SQL − DB2-compatible query engine for Hadoop data (IBM or Hortonworks distributions) − Based on decades of IBM R&D investment in RDBMS technology, including database parallelism and query optimization. Strong runtime performance for analytical workloads. § Some ways to leverage DB2 SQL skills − Create / manage / query “local” or distributed tables in Hadoop − Query / join Hadoop data with DB2, Oracle, Teradata, etc. data via query federation − Leverage Spark to query and manipulate Big SQL or DB2 data − Leverage Big SQL to initiate Spark jobs and analyze result
  • 3. © 2016 IBM Corporation3 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary § Summary
  • 4. © 2016 IBM Corporation4 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary § Summary
  • 5. © 2016 IBM Corporation5 Business leaders frequently make decisions based on information they don’ttrust, or don’t have1in3 83% of CIOs cited “Business intelligence and analytics” as part of their visionary plans to enhance competitiveness Business leaders say they don’t have access to the information they need to do their jobs 1in2 of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions 60% … and organizations need deeper insights Information is at the center of a new wave of opportunity… 4 million “likes” per minute 300,000 tweets per minute 150 million emails per minute 2.78 million video views per minute 2.5 TB per day per A350 plane > 1 PB per day gas turbines 1 ZB = 1 billion TB
  • 6. © 2016 IBM Corporation6 Big Data adoption (study results) 2012 to 2014 2015 22%-27% 25% 0% change 2012 to 2014 2015 24%-26% 10% 250% decrease Educate: Learning about big data capabilities 2012 to 2014 2015 43%-47% 53% 125% increase Explore: Exploring internal use cases and developing a strategy Engage: Implementing infrastructure and running pilot activities 2012 to 2014 2015 5%-6% 13% 210% increase Execute: Using big data and analytics pervasively across the enterprise 2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)
  • 7. © 2016 IBM Corporation7 Return on investment period for big data and analytics projects as reported by respondents Big Data ROI often < 18 months 2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)
  • 8. © 2016 IBM Corporation8 § Both open source Apache projects − Exploit distributed computing environments − Enable processing of large volumes of varied data § Hadoop − Inspired by Google technologies (MapReduce, GFS) − Originally designed for batch-oriented, read-intensive applications − “Core” consists of distributed file system, MapReduce, job scheduler, utilities − Complementary projects span data warehousing, workflow management, columnar data storage, activity monitoring, . . . § Spark − Began as a UC Berkeley project − Fast, general-purpose engine for working with Big Data in memory − Popular built-in libraries for machine learning, streaming data, query (SQL), . . . − No built-in storage. Interfaces to Hadoop, other stores About Hadoop and Spark
  • 9. © 2016 IBM Corporation9 IBM contributions: Hadoop and Spark Snapshots taken Jan. 2017. Latest content available online via Apache dashboards. IOP relates to Hadoop; STC relates to Spark.
  • 10. © 2016 IBM Corporation10 What is Big SQL? SQL-based Application Big SQL Engine Data Storage IBM data server client SQL MPP Run-time HDFS § Comprehensive, standard SQL for Hadoop – SELECT: joins, unions, aggregates, subqueries . . . – UPDATE/DELETE (HBase-managed tables) – GRANT/REVOKE, INSERT … INTO – SQL procedural logic (SQL PL) – Stored procs, user-defined functions – IBM data server JDBC and ODBC drivers § Optimization and performance – IBM MPP engine (C++) replaces Java MapReduce layer – Continuous running daemons (no start up latency) – Message passing allow data to flow between nodes without persisting intermediate results – In-memory operations with ability to spill to disk (useful for aggregations, sorts that exceed available RAM) – Cost-based query optimization with 140+ rewrite rules § Various storage formats supported – Text (delimited), Sequence, RCFile, ORC, Avro, Parquet – Data persisted in DFS, Hive, HBase – No IBM proprietary format required § Integration with RDBMSs via LOAD, query federation IBM Open Platform or Hortonworks Data Platform
  • 11. © 2016 IBM Corporation11 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary § Summary
  • 12. © 2016 IBM Corporation12 § Big SQL − Easy on-ramp to Hadoop for DB2 SQL professionals − Create query-ready data lake − Offload “cold” RDBMS warehouse data to Hadoop − . . . . § Some ways to use Big SQL . . . − Create tables − Load / insert data − Execute complex queries − Exploit various DB2 features: UDFs, EXPLAIN, workload management, Oracle / Netezza SQL compatibility. . . . − Exploit various Hadoop features: Hive, HBase, SerDes, . . . About Hadoop and Big SQL
  • 13. © 2016 IBM Corporation13 Invocation options § Command-line interface: Java SQL Shell (JSqsh) § Web tooling (Data Server Manager) § Tools that support IBM JDBC/ODBC driver
  • 14. © 2016 IBM Corporation14 Creating a Big SQL table § Standard CREATE TABLE DDL with extensions create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null) row format delimited fields terminated by '|' stored as textfile; Worth noting: • “Hadoop” keyword creates table in HDFS • Row format delimited and textfile formats are default • Constraints not enforced (but useful for query optimization) • Examples in these charts focus on HDFS storage, both within or external to Hive warehouse. HBase examples provided separately
  • 15. © 2016 IBM Corporation15 CREATE VIEW § Standard SQL syntax create view my_users as select fname, lname from biadmin.users where id > 100;
  • 16. © 2016 IBM Corporation16 Populating tables via LOAD § Typically best runtime performance § Load data from local or remote file system load hadoop using file url 'sftp://myID:[email protected]:22/install- dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite; § Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection load hadoop using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb' with parameters (user='myID', password='myPassword') from table MEDIA columns (ID, NAME) where 'CONTACTDATE < ''2012-02-01''' into table media_db2table_jan overwrite with load properties ('num.map.tasks' = 10);
  • 17. © 2016 IBM Corporation17 Populating tables via INSERT § INSERT INTO . . . SELECT FROM . . . − Parallel read and write operations CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet ( product_key INT NOT NULL, product_name VARCHAR(150), Quantity INT, order_method_en VARCHAR(90) ) STORED AS parquetfile; -- source tables do not need to be in Parquet format insert into big_sales_parquet SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb, sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key and sales.quantity > 5500; § INSERT INTO . . . VALUES(. . . ) − Not parallelized. 1 file per INSERT. Not recommended except for quick tests CREATE HADOOP TABLE foo col1 int, col2 varchar(10)); INSERT INTO foo VALUES (1, ‘hello’);
  • 18. © 2016 IBM Corporation18 CREATE . . . TABLE . . . AS SELECT . . . § Create a Big SQL table based on contents of other table(s) § Source tables can be in different file formats or use different underlying storage mechanisms -- source tables in this example are external (just DFS files) CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat ( product_key INT NOT NULL , product_line_code INT NOT NULL , product_type_key INT NOT NULL , product_type_code INT NOT NULL , product_line_en VARCHAR(90) , product_line_de VARCHAR(90) ) as select product_key, d.product_line_code, product_type_key, product_type_code, product_line_en, product_line_de from extern.sls_product_dim d, extern.sls_product_line_lookup l where d.product_line_code = l.product_line_code;
  • 19. © 2016 IBM Corporation19 SQL capability highlights § Query operations − Projections, restrictions − UNION, INTERSECT, EXCEPT − Wide range of built-in functions (e.g. OLAP) − Various Oracle, Netezza compatibility items § Full support for subqueries − In SELECT, FROM, WHERE and HAVING clauses − Correlated and uncorrelated − Equality, non-equality subqueries − EXISTS, NOT EXISTS, IN, ANY, SOME, etc. § All standard join operations − Standard and ANSI join syntax − Inner, outer, and full outer joins − Equality, non-equality, cross join support − Multi-value join § Stored procedures, user-defined functions, user-defined aggregates SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name;
  • 20. © 2016 IBM Corporation20 Power of standard SQL § Big SQL executes all 22 TPC-H queries without modification § Big SQL executes all 99 TPC-DS queries without modification § Big SQL leverages DB2 query rewrite technology for efficient optimization SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name JOIN (SELECT s_name, l_orderkey, l_suppkey FROM orders o JOIN (SELECT s_name, l_orderkey, l_suppkey FROM nation n JOIN supplier s ON s.s_nationkey = n.n_nationkey AND n.n_name = 'INDONESIA' JOIN lineitem l ON s.s_suppkey = l.l_suppkey WHERE l.l_receiptdate > l.l_commitdate) l1 ON o.o_orderkey = l1.l_orderkey AND o.o_orderstatus = 'F') l2 ON l2.l_orderkey = t1.l_orderkey) a WHERE (count_suppkey > 1) or ((count_suppkey=1) AND (l_suppkey <> max_suppkey))) l3 ON l3.l_orderkey = t2.l_orderkey) b WHERE (count_suppkey is null) OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c GROUP BY s_name ORDER BY numwait DESC, s_name SELECT s_name, count(1) AS numwait FROM (SELECT s_name FROM (SELECT s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem WHERE l_receiptdate > l_commitdate GROUP BY l_orderkey) t2 RIGHT OUTER JOIN (SELECT s_name, l_orderkey, l_suppkey FROM (SELECT s_name, t1.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem GROUP BY l_orderkey) t1 Original Query Re-written query
  • 21. © 2016 IBM Corporation21 Query federation = virtualized data access Transparent § Appears to be one source § Programmers don’t need to know how / where data is stored Heterogeneous § Accesses data from diverse sources High Function § Full query support against all data § Capabilities of sources as well Autonomous § Non-disruptive to data sources, existing applications, systems. High Performance § Optimization of distributed queries SQL tools, applications Data sources Virtualized data
  • 22. © 2016 IBM Corporation22 Federation in practice § Admin enables federation § Apps connect to Big SQL database § Nicknames look like tables to the app § Big SQL optimizer creates global data access plan with cost analysis, query push down § Query fragments executed remotely Nickname Nickname Table Cost-based optimizer Wrapper Client library Wrapper Client library Local + Remote Execution Plans Remote sources Federation server (Big SQL) Native dialect Connect to bigsql
  • 23. © 2016 IBM Corporation23 Joining data across sources
  • 24. © 2016 IBM Corporation24 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary § Summary
  • 25. © 2016 IBM Corporation25 About Spark and Big SQL § Easy to query Big SQL (or DB2 LUW) tables through Spark SQL − See link to self-study lab in “Resources” section § Follow typical Spark SQL JDBC data source pattern − Identify JDBC driver and connection properties − Load table contents into DataFrame, Spark SQL temporary view − Execute Spark SQL queries − Applies to Big SQL tables in Hive warehouse, HBase, or arbitrary HDFS directory − Query results can be manipulated via other Spark libraries § Technical preview: Launch Spark jobs from Big SQL via UDF
  • 26. © 2016 IBM Corporation26 Accessing Big SQL data from Spark shell // based on BigInsights tech preview release that includes Spark 2.1 // Launch shell with --driver-class-path pointing to JDBC driver .jar // read data from Big SQL table “t1” and load into a DataFrame val sampleDF = spark.read.format("jdbc") .option("url”,"jdbc:db2://yourHost.com:32051/BIGSQL") .option("dbtable",”yourSchema.t1") .option("user", "yourID").option("password", "yourPassword") .load() // display full contents sampleDF.show() // create a Spark SQL temporary view to query sampleDF.createOrReplaceTempView("v1") // query the view and display the results sql("select col1, col3 from v1 where col2 > 100 limit 15”).show()
  • 27. © 2016 IBM Corporation27 Technical preview: launch Spark jobs from Big SQL § Spark jobs can be invoked from Big SQL using a table UDF abstraction § Example: Call the SYSHADOOP.EXECSPARK built-in UDF to kick off a Spark job that reads a JSON file stored on HDFS SELECT * FROM TABLE(SYSHADOOP.EXECSPARK( language => 'scala', class => 'com.ibm.biginsights.bigsql.examples.ReadJsonFile', uri => 'hdfs://host.port.com:8020/user/bigsql/demo.json', card => 100000)) AS doc WHERE doc.country IS NOT NULL
  • 28. © 2016 IBM Corporation28 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/ § Summary
  • 29. © 2016 IBM Corporation29 What is TPC-DS? § TPC = Transaction Processing Council − Non-profit corporation (vendor independent) − Defines various industry driven database benchmarks…. DS = Decision Support − Models a multi-domain data warehouse environment for a hypothetical retailer Retail Sales Web Sales Inventory Demographics Promotions Multiple scale factors: 100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB 99 Pre-Defined Queries Query Classes: Reporting Ad HocIterative OLAP Data Mining
  • 30. © 2016 IBM Corporation30 100TB TPC-DS is BIGdata
  • 31. © 2016 IBM Corporation31 Benchmark Environment: IBM “F1” Spark SQL Cluster § 28 Nodes Total (Lenovo x3640 M5) § Each configured as: • 2 sockets (18 cores/socket) • 1.5 TB RAM • 8x 2TB SSD § 2 Racks − 20x 2U servers per rack (42U racks) § 1 Switch, 100GbE, 32 ports Mellanox SN2700
  • 32. © 2016 IBM Corporation32 PERFORMANCE SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE WORKING QUERIES COMPRESSION 60%SPACE SAVED WITH PARQUET Spark SQL completes more TPC-DS queries than any other open source SQL engine for Hadoop @ 100TB Scale
  • 33. © 2016 IBM Corporation33 Query Compliance Through the Scale Factors § SQL compliance is important because Business Intelligence tools generate standard SQL − Rewriting queries is painful and impacts productivity § Spark SQL 2.1 can run all 99 TPC-DS queries but only at lower scale factors § Spark SQL Failures @ 100 TB: − 12 runtime errors − 4 timeout (> 10 hours) Spark SQL § Big SQL has been successfully executing all 99 queries since Oct 2014 § IBM is the only vendor that has proven SQL compatibility at scale factors up to 100TB Big SQL
  • 34. © 2016 IBM Corporation34 Big SQL is 3.2X faster than Spark 2.1 (4 Concurrent Streams) Big SQL @ 99 queries still outperforms Spark SQL @ 83 queries
  • 35. © 2016 IBM Corporation35 PERFORMANCE Big SQL 3.2x faster HADOOP-DS @ 100TB: AT A GLANCE WORKING QUERIES CPU (vs Spark) Big SQL uses 3.7x less CPU I/O (vs Spark) Big SQL reads 12x less data Big SQL writes 30x less data COMPRESSION 60%SPACE SAVED WITH PARQUET AVERAGE CPU USAGE 76.4% MAX I/O THROUGHPUT: READ 4.4 GB/SEC WRITE 2.8 GB/SEC
  • 36. © 2016 IBM Corporation36 Recommendation: Right Tool for the Right Job Machine Learning Simpler SQL Good Performance Ideal tool for BI Data Analysts and production workloads Ideal tool for Data Scientists and discovery Big SQL Spark SQL Migrating existing workloads to Hadoop Security Many Concurrent Users Best Performance Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster
  • 37. © 2016 IBM Corporation37 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary § Summary
  • 38. © 2016 IBM Corporation38 Summary § Big SQL = easy path for DB2 professionals to work with Big Data § Runs on popular Hadoop platforms from IBM, Hortonworks § Integrates with Spark § Compatible with DB2 and ISO SQL § Brings high-performance, enterprise-grade query engine to popular open source Big Data platforms
  • 39. © 2016 IBM Corporation39 Want to learn more? § Hadoop Dev https://developer.ibm.com/hadoop/ § Labs: Big SQL intro, Spark / Big SQL, . . . https://developer.ibm.com/hadoop/docs/getting- started/tutorials/big-sql-hadoop-tutorial/ § 100TB benchmark https://developer.ibm.com/hadoop/2017/02/07/experiences- comparing-big-sql-and-spark-sql-at-100tb/ § This presentation https://www.slideshare.net/CynthiaSaracco/presentations
  • 40. © 2016 IBM Corporation40 Supplemental
  • 41. © 2016 IBM Corporation41 Big SQL architecture § Head (coordinator / management) node − Listens to the JDBC/ODBC connections − Compiles and optimizes the query − Optionally store user data in DB2-compatible table (single node only). Useful for some reference data. § Big SQL worker processes reside on compute nodes (some or all) § Worker nodes stream data between each other as needed § Workers can spill large data sets to local disk if needed − Allows Big SQL to work with data sets larger than available memory
  • 42. © 2016 IBM Corporation42 CPU Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams Spark SQL uses almost 3x more systemCPU. These are wasted CPU cycles. Average CPU Utilization: 76.4% Average CPU Utilization: 88.2%
  • 43. © 2016 IBM Corporation43 I/O Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams Spark SQL required 3.6X more reads 9.5X more writes Big SQL can drive peak I/O nearly 2X more