Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation
Using your DB2 skills with Hadoop and Spark
Presented to TRIDEX DB2 Users Group, June 2017
C. M. Saracco, IBM Silicon Valley Lab
https://www.slideshare.net/CynthiaSaracco/presentations

© 2016 IBM Corporation2
Executive summary
§ About Apache Hadoop and Spark
− Popular open source technologies for working with Big Data
• Clustered computing > scalability
• Varied data > no pre-set structure or schema requirements
− Hadoop: distributed file system (storage), MapReduce API, . . .
− Spark: in-memory data processing (speed), built-in libraries, . . .
§ About Big SQL
− DB2-compatible query engine for Hadoop data (IBM or Hortonworks distributions)
− Based on decades of IBM R&D investment in RDBMS technology, including database
parallelism and query optimization. Strong runtime performance for analytical workloads.
§ Some ways to leverage DB2 SQL skills
− Create / manage / query “local” or distributed tables in Hadoop
− Query / join Hadoop data with DB2, Oracle, Teradata, etc. data via query federation
− Leverage Spark to query and manipulate Big SQL or DB2 data
− Leverage Big SQL to initiate Spark jobs and analyze result

Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary

Agenda
− Market drivers
− Query tables
§ Summary

Business leaders frequently make
decisions based on information they
don’ttrust, or don’t have1in3
83%
of CIOs cited “Business
intelligence and analytics” as part
of their visionary plans
to enhance competitiveness
Business leaders say they don’t
have access to the information they
need to do their jobs
1in2
of CEOs need to do a better job
capturing and understanding
information rapidly in order to
make swift business decisions
60%
… and organizations
need deeper insights
Information is at the center
of a new wave of opportunity…
4 million “likes” per
minute
300,000 tweets
per minute
150 million emails
per minute 2.78 million video
views per minute
2.5 TB per day
per A350 plane
> 1 PB per day
gas turbines
1 ZB = 1 billion TB

Big Data adoption (study results)
2012 to 2014 2015
22%-27% 25% 0%
change
2012 to 2014 2015
24%-26% 10% 250%
decrease
Educate:
Learning about
big data capabilities
2012 to 2014 2015
43%-47% 53% 125%
increase
Explore:
Exploring internal use cases and
developing a strategy
Engage:
Implementing infrastructure and
running pilot activities
2012 to 2014 2015
5%-6% 13% 210%
increase
Execute:
Using big data and analytics
pervasively across the enterprise
2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)

Return on investment period for big data and analytics projects
as reported by respondents
Big Data ROI often < 18 months
2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)

§ Both open source Apache projects
− Exploit distributed computing environments
− Enable processing of large volumes of varied data
§ Hadoop
− Inspired by Google technologies (MapReduce, GFS)
− Originally designed for batch-oriented, read-intensive applications
− “Core” consists of distributed file system, MapReduce, job scheduler, utilities
− Complementary projects span data warehousing, workflow management,
columnar data storage, activity monitoring, . . .
§ Spark
− Began as a UC Berkeley project
− Fast, general-purpose engine for working with Big Data in memory
− Popular built-in libraries for machine learning, streaming data, query (SQL), . . .
− No built-in storage. Interfaces to Hadoop, other stores
About Hadoop and Spark

IBM contributions: Hadoop and Spark
Snapshots taken Jan. 2017.
Latest content available online
via Apache dashboards.
IOP relates to Hadoop; STC
relates to Spark.

What is Big SQL?
SQL-based
Application
Big SQL Engine
Data Storage
IBM data server
client
SQL MPP Run-time
HDFS
§ Comprehensive, standard SQL for Hadoop
– SELECT: joins, unions, aggregates, subqueries . . .
– UPDATE/DELETE (HBase-managed tables)
– GRANT/REVOKE, INSERT … INTO
– SQL procedural logic (SQL PL)
– Stored procs, user-defined functions
– IBM data server JDBC and ODBC drivers
§ Optimization and performance
– IBM MPP engine (C++) replaces Java MapReduce layer
– Continuous running daemons (no start up latency)
– Message passing allow data to flow between nodes
without persisting intermediate results
– In-memory operations with ability to spill to disk (useful
for aggregations, sorts that exceed available RAM)
– Cost-based query optimization with 140+ rewrite rules
§ Various storage formats supported
– Text (delimited), Sequence, RCFile, ORC, Avro, Parquet
– Data persisted in DFS, Hive, HBase
– No IBM proprietary format required
§ Integration with RDBMSs via LOAD, query
federation
IBM Open Platform or
Hortonworks Data Platform

Agenda
− Market drivers
− Query tables
§ Summary

§ Big SQL
− Easy on-ramp to Hadoop for DB2 SQL professionals
− Create query-ready data lake
− Offload “cold” RDBMS warehouse data to Hadoop
− . . . .
§ Some ways to use Big SQL . . .
− Create tables
− Load / insert data
− Execute complex queries
− Exploit various DB2 features: UDFs, EXPLAIN, workload management, Oracle /
Netezza SQL compatibility. . . .
− Exploit various Hadoop features: Hive, HBase, SerDes, . . .
About Hadoop and Big SQL

Invocation options
§ Command-line interface:
Java SQL Shell (JSqsh)
§ Web tooling (Data Server
Manager)
§ Tools that support IBM
JDBC/ODBC driver

Creating a Big SQL table
§ Standard CREATE TABLE DDL with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;
Worth noting:
• “Hadoop” keyword creates table in HDFS
• Row format delimited and textfile formats are default
• Constraints not enforced (but useful for query optimization)
• Examples in these charts focus on HDFS storage, both within or external to Hive
warehouse. HBase examples provided separately

CREATE VIEW
§ Standard SQL syntax
create view my_users as
select fname, lname from biadmin.users where id > 100;

Populating tables via LOAD
§ Typically best runtime performance
§ Load data from local or remote file system
load hadoop using file url
'sftp://myID:myPassword@myServer.ibm.com:22/install-
dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES
('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
§ Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL,
Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);

Populating tables via INSERT
§ INSERT INTO . . . SELECT FROM . . .
− Parallel read and write operations
CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet
( product_key INT NOT NULL, product_name VARCHAR(150),
Quantity INT, order_method_en VARCHAR(90) )
STORED AS parquetfile;
-- source tables do not need to be in Parquet format
insert into big_sales_parquet
SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en
FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key
and sales.quantity > 5500;
§ INSERT INTO . . . VALUES(. . . )
− Not parallelized. 1 file per INSERT. Not recommended except for quick tests
CREATE HADOOP TABLE foo col1 int, col2 varchar(10));
INSERT INTO foo VALUES (1, ‘hello’);

CREATE . . . TABLE . . . AS SELECT . . .
§ Create a Big SQL table based on contents of other table(s)
§ Source tables can be in different file formats or use different
underlying storage mechanisms
-- source tables in this example are external (just DFS files)
CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_line_en VARCHAR(90)
, product_line_de VARCHAR(90)
)
as select product_key, d.product_line_code, product_type_key,
product_type_code, product_line_en, product_line_de
from extern.sls_product_dim d, extern.sls_product_line_lookup l
where d.product_line_code = l.product_line_code;

SQL capability highlights
§ Query operations
− Projections, restrictions
− UNION, INTERSECT, EXCEPT
− Wide range of built-in functions (e.g. OLAP)
− Various Oracle, Netezza compatibility items
§ Full support for subqueries
− In SELECT, FROM, WHERE and
HAVING clauses
− Correlated and uncorrelated
− Equality, non-equality subqueries
− EXISTS, NOT EXISTS, IN, ANY,
SOME, etc.
§ All standard join operations
− Standard and ANSI join syntax
− Inner, outer, and full outer joins
− Equality, non-equality, cross join support
− Multi-value join
§ Stored procedures, user-defined
functions, user-defined aggregates
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name;

Power of standard SQL
§ Big SQL executes all 22 TPC-H queries without modification
§ Big SQL executes all 99 TPC-DS queries without modification
§ Big SQL leverages DB2 query rewrite technology for efficient optimization
SELECT s_name, count(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (
SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM orders o
JOIN
FROM nation n
JOIN supplier s
ON s.s_nationkey = n.n_nationkey
AND n.n_name = 'INDONESIA'
JOIN lineitem l
ON s.s_suppkey = l.l_suppkey
WHERE l.l_receiptdate > l.l_commitdate) l1
ON o.o_orderkey = l1.l_orderkey
AND o.o_orderstatus = 'F') l2
ON l2.l_orderkey = t1.l_orderkey) a
WHERE (count_suppkey > 1) or ((count_suppkey=1)
AND (l_suppkey <> max_suppkey))) l3
ON l3.l_orderkey = t2.l_orderkey) b
WHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c
GROUP BY s_name
ORDER BY numwait DESC, s_name
SELECT s_name, count(1) AS numwait
FROM
(SELECT s_name FROM
(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
WHERE l_receiptdate > l_commitdate
GROUP BY l_orderkey) t2
RIGHT OUTER JOIN
FROM
(SELECT s_name, t1.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
GROUP BY l_orderkey) t1
Original Query
Re-written query

Query federation = virtualized data access
Transparent
§ Appears to be one source
§ Programmers don’t need to know how /
where data is stored
Heterogeneous
§ Accesses data from diverse sources
High Function
§ Full query support against all data
§ Capabilities of sources as well
Autonomous
§ Non-disruptive to data sources, existing
applications, systems.
High Performance
§ Optimization of distributed queries
SQL tools,
applications Data sources
Virtualized
data

Federation in practice
§ Admin enables
federation
§ Apps connect to Big
SQL database
§ Nicknames look like
tables to the app
§ Big SQL optimizer
creates global data
access plan with cost
analysis, query push
down
§ Query fragments
executed remotely
Nickname
Nickname
Table
Cost-based optimizer
Wrapper
Client library
Wrapper
Client library
Local + Remote
Execution Plans
Remote sources
Federation server (Big SQL)
Native dialect
Connect to bigsql

Joining data across sources

Agenda
− Market drivers
− Query tables
§ Summary

About Spark and Big SQL
§ Easy to query Big SQL (or DB2 LUW) tables through Spark SQL
− See link to self-study lab in “Resources” section
§ Follow typical Spark SQL JDBC data source pattern
− Identify JDBC driver and connection properties
− Load table contents into DataFrame, Spark SQL temporary view
− Execute Spark SQL queries
− Applies to Big SQL tables in Hive warehouse, HBase, or arbitrary HDFS
directory
− Query results can be manipulated via other Spark libraries
§ Technical preview: Launch Spark jobs from Big SQL via UDF

Accessing Big SQL data from Spark shell
// based on BigInsights tech preview release that includes Spark 2.1
// Launch shell with --driver-class-path pointing to JDBC driver .jar
// read data from Big SQL table “t1” and load into a DataFrame
val sampleDF = spark.read.format("jdbc")
.option("url”,"jdbc:db2://yourHost.com:32051/BIGSQL")
.option("dbtable",”yourSchema.t1")
.option("user", "yourID").option("password", "yourPassword")
.load()
// display full contents
sampleDF.show()
// create a Spark SQL temporary view to query
sampleDF.createOrReplaceTempView("v1")
// query the view and display the results
sql("select col1, col3 from v1 where col2 > 100 limit 15”).show()

Technical preview: launch Spark jobs from Big SQL
§ Spark jobs can be invoked from Big SQL using a table UDF
abstraction
§ Example: Call the SYSHADOOP.EXECSPARK built-in UDF to kick
off a Spark job that reads a JSON file stored on HDFS
SELECT *
FROM TABLE(SYSHADOOP.EXECSPARK(
language => 'scala',
class =>
'com.ibm.biginsights.bigsql.examples.ReadJsonFile',
uri =>
'hdfs://host.port.com:8020/user/bigsql/demo.json',
card => 100000)) AS doc
WHERE doc.country IS NOT NULL

Agenda
− Market drivers
− Query tables
https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/
§ Summary

What is TPC-DS?
§ TPC = Transaction Processing Council
− Non-profit corporation (vendor independent)
− Defines various industry driven database benchmarks…. DS = Decision Support
− Models a multi-domain data warehouse environment for a hypothetical retailer
Retail Sales Web Sales Inventory Demographics Promotions
Multiple scale factors:
100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB
99 Pre-Defined
Queries
Query Classes:
Reporting Ad HocIterative
OLAP
Data
Mining

100TB TPC-DS is BIGdata

Benchmark Environment: IBM “F1” Spark SQL Cluster
§ 28 Nodes Total (Lenovo x3640 M5)
§ Each configured as:
• 2 sockets (18 cores/socket)
• 1.5 TB RAM
• 8x 2TB SSD
§ 2 Racks
− 20x 2U servers per rack (42U racks)
§ 1 Switch, 100GbE, 32 ports Mellanox
SN2700

PERFORMANCE
SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE
WORKING QUERIES
COMPRESSION
60%SPACE SAVED
WITH
PARQUET
Spark SQL completes more
TPC-DS queries than any other
open source SQL engine for Hadoop
@ 100TB Scale

Query Compliance Through the Scale Factors
§ SQL compliance is important because Business Intelligence tools generate standard SQL
− Rewriting queries is painful and impacts productivity
§ Spark SQL 2.1 can run all 99 TPC-DS
queries but only at lower scale factors
§ Spark SQL Failures @ 100 TB:
− 12 runtime errors
− 4 timeout (> 10 hours)
Spark SQL
§ Big SQL has been successfully executing all
99 queries since Oct 2014
§ IBM is the only vendor that has proven SQL
compatibility at scale factors up to 100TB
Big SQL

Big SQL is 3.2X faster than Spark 2.1
(4 Concurrent Streams)
Big SQL @ 99 queries still
outperforms Spark SQL @ 83 queries

PERFORMANCE
Big SQL 3.2x faster
HADOOP-DS @ 100TB: AT A GLANCE
WORKING QUERIES
CPU (vs Spark)
Big SQL uses 3.7x less CPU
I/O (vs Spark)
Big SQL reads 12x less data
Big SQL writes 30x less data
COMPRESSION
60%SPACE SAVED
WITH PARQUET
AVERAGE CPU
USAGE
76.4%
MAX I/O
THROUGHPUT:
READ 4.4 GB/SEC
WRITE 2.8 GB/SEC

Recommendation: Right Tool for the Right Job
Machine Learning
Simpler SQL
Good Performance
Ideal tool for BI Data Analysts
and production workloads
Ideal tool for Data Scientists
and discovery
Big SQL Spark SQL
Migrating existing
workloads to Hadoop
Security
Many Concurrent Users
Best Performance
Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster

Agenda
− Market drivers
− Query tables
§ Summary

Summary
§ Big SQL = easy path for DB2 professionals to work with Big Data
§ Runs on popular Hadoop platforms from IBM, Hortonworks
§ Integrates with Spark
§ Compatible with DB2 and ISO SQL
§ Brings high-performance, enterprise-grade query engine to popular
open source Big Data platforms

Want to learn more?
§ Hadoop Dev
https://developer.ibm.com/hadoop/
§ Labs: Big SQL intro, Spark / Big SQL, . . .
https://developer.ibm.com/hadoop/docs/getting-
started/tutorials/big-sql-hadoop-tutorial/
§ 100TB benchmark
https://developer.ibm.com/hadoop/2017/02/07/experiences-
comparing-big-sql-and-spark-sql-at-100tb/
§ This presentation
https://www.slideshare.net/CynthiaSaracco/presentations

Supplemental

Big SQL architecture
§ Head (coordinator / management) node
− Listens to the JDBC/ODBC connections
− Compiles and optimizes the query
− Optionally store user data in DB2-compatible table (single node only). Useful for some reference data.
§ Big SQL worker processes reside on compute nodes (some or all)
§ Worker nodes stream data between each other as needed
§ Workers can spill large data sets to local disk if needed
− Allows Big SQL to work with data sets larger than available memory

CPU Profile for Big SQL vs. Spark SQL
Hadoop-DS @ 100TB, 4 Concurrent Streams
Spark SQL uses almost 3x more systemCPU.
These are wasted CPU cycles.
Average CPU
Utilization: 76.4%
Average CPU
Utilization: 88.2%

I/O Profile for Big SQL vs. Spark SQL
Hadoop-DS @ 100TB, 4 Concurrent Streams
Spark SQL required
3.6X more reads
9.5X more writes
Big SQL can
drive peak I/O nearly
2X more

Using your DB2 SQL Skills with Hadoop and Spark

More Related Content

What's hot (17)

Similar to Using your DB2 SQL Skills with Hadoop and Spark (20)

Recently uploaded (20)

Using your DB2 SQL Skills with Hadoop and Spark