Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Warehouse

info@rittmanmead.com www.rittmanmead.com @rittmanmead
Using Oracle Big Data SQL to add Hadoop + NoSQL 
to your Oracle Data Warehouse
Mark Rittman, CTO, Rittman Mead
SQL Celebration Day, Netherlands June 2016

info@rittmanmead.com www.rittmanmead.com @rittmanmead 2
•Many customers and organisations are now running initiatives around “big data”

•Some are IT-led and are looking for cost-savings around data warehouse storage + ETL

•Others are “skunkworks” projects in the marketing department that are now scaling-up

•Projects now emerging from pilot exercises

•And design patterns starting to emerge
Many Organisations are Running Big Data Initiatives

•Gives us an ability to store more data, at more detail, for longer

•Provides a cost-effective way to analyse vast amounts of data

•Hadoop & NoSQL technologies can give us “schema-on-read” capabilities

•There’s vast amounts of innovation in this area we can harness

•And it’s very complementary to Oracle BI & DW
Why is Hadoop of Interest to Us?

•Mark Rittman, Co-Founder of Rittman Mead

‣Oracle ACE Director, specialising in Oracle BI&DW

‣14 Years Experience with Oracle Technology

‣Regular columnist for Oracle Magazine

•Author of two Oracle Press Oracle BI books

‣Oracle Business Intelligence Developers Guide

‣Oracle Exalytics Revealed

‣Writer for Rittman Mead Blog : 
http://www.rittmanmead.com/blog

•Email : mark.rittman@rittmanmead.com

•Twitter : @markrittman
About the Speaker

Flexible Cheap Storage for Logs, Feeds + Social Data
$50k
Hadoop
Node
Voice + Chat
Transcripts
Call Center LogsChat Logs iBeacon Logs Website LogsCRM Data Transactions Social FeedsDemographics
Raw Data
Customer 360 Apps
Predictive  
Models
SQL-on-Hadoop
Business analytics
Real-time Feeds, 
batch and API

Incorporate Hadoop Data Reservoirs into DW Design

Oracle Big Data Product Architecture

•Oracle Big Data Appliance - Engineered System for running Hadoop alongside Exadata

•Oracle Big Data Connectors - Utility from Oracle for feeding Hadoop data into Oracle

•Oracle Data Integrator EE Big Data Option - Add Spark, Pig data transforms to Oracle ODI

•Oracle BI Enterprise Edition - can connect to Hive, Impala for federated queries

•Oracle Big Data Discovery - data wrangling + visualization tool for Hadoop data reservoirs

•Oracle Big Data SQL - extend Oracle SQL  
language + processing to Hadoop
Oracle Software Initiatives around Big Data

Where everything is non-relational

isn’t SQL on Hadoop 
somewhat missing the point?

Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Warehouse

Where Can SQL Processing Be Useful with Hadoop?
•Hadoop is not a cheap substitute for enterprise DW platforms - don’t use it like this

•But adding SQL processing and abstraction can help in many scenarios:

• Query access to data stored in Hadoop as an archive

• Aggregating, sorting, filtering and transforming data

• Set-based transformation capabilities for other frameworks (e.g. Spark)

• Ad-hoc analysis and data discovery in-real time

• Providing tabular abstractions over complex datatypes
SQL!
Though  
SQL 
isn’t actually 
relational
According
to Chris Date 
SQL is just 
mappings
Tedd Codd 
used  
Predicate 
Calculus
and there’s 
never been 
a mainstream 
relational 
DBMS
but it is the 
standard 
language for 
RDBMSs
and it’s great 
for set-based 
transforms 
& queries
so 
Yes SQL!

•Original developed at Facebook, now foundational within the Hadoop project

•Allows users to query Hadoop data using SQL-like language

•Tabular metadata layer that overlays files, can interpret semi-structured data (e.g. JSON)

•Generates MapReduce code to return required data

•Extensible through SerDes and Storage Handlers

•JDBC and ODBC drivers for most platforms/tools

•Perfect for set-based access + batch ETL work
Apache Hive : SQL Metadata + Engine over Hadoop

•Hive uses a RBDMS metastore to hold 
table and column definitions in schemas

•Hive tables then map onto HDFS-stored files

‣Managed tables

‣External tables

•Oracle-like query optimizer, compiler, 
executor

•JDBC and OBDC drivers, 
plus CLI etc
16
How Does Hive Translate SQL into MapReduce?
Hive Thrift
Server
JDBC / ODBC
Parser Planner
Execution Engine
Metastore
MapReduc
e
HDFS
HueCLI

•Hive uses a RBDMS metastore to hold 
table and column definitions in schemas

•Hive tables then map onto HDFS-stored files

‣Managed tables

‣External tables

•Oracle-like query optimizer, compiler, 
executor

•JDBC and OBDC drivers, 
plus CLI etc
17
How Does Hive Translate SQL into MapReduce?
hive> select count(*) from src_customer;
 
Total MapReduce jobs = 1 
Launching Job 1 out of 1 
Number of reduce tasks determined at compile time: 1 
In order to change the average load for a reducer (in bytes): 
set hive.exec.reducers.bytes.per.reducer= 
In order to limit the maximum number of reducers: 
set hive.exec.reducers.max= 
In order to set a constant number of reducers: 
set mapred.reduce.tasks= 
Starting Job = job_201303171815_0003, Tracking URL =  
http://localhost.localdomain:50030/jobdetails.jsp… 
Kill Command = /usr/lib/hadoop-0.20/bin/ 
hadoop job -Dmapred.job.tracker=localhost.localdomain:8021  
-kill job_201303171815_0003 
 
2013-04-17 04:06:59,867 Stage-1 map = 0%, reduce = 0% 
2013-04-17 04:07:03,926 Stage-1 map = 100%, reduce = 0% 
2013-04-17 04:07:14,040 Stage-1 map = 100%, reduce = 33% 
2013-04-17 04:07:15,049 Stage-1 map = 100%, reduce = 100% 
Ended Job = job_201303171815_0003 
OK 
25 
Time taken: 22.21 seconds
HiveQL 
Query
MapReduce 
Job submitted
Results  
returned

But Hive is too slow  
for ad-hoc queries

•Cloudera’s answer to Hive query response time issues

•MPP SQL query engine running on Hadoop, bypasses MapReduce for
direct data access

•Mostly in-memory, but spills to disk if required

•Uses Hive metastore to access Hive table metadata

•Similar SQL dialect to Hive - not as rich though and no support for Hive
SerDes, storage handlers etc
Cloudera Impala - Fast, MPP-style Access to Hadoop Data

•Apache Drill is another SQL-on-Hadoop project that focus on schema-free data discovery

•Inspired by Google Dremel, innovation is querying raw data with schema optional

•Automatically infers and detects schema from semi-structured datasets and NoSQL DBs

•Join across different silos of data e.g. JSON records, Hive tables and HBase database

•Aimed at different use-cases than Hive -  
low-latency queries, discovery  
(think Endeca vs OBIEE)
Apache Drill - SQL for Schema-Free Data Discovery

•A replacement for Hive, but uses Hive concepts and 
data dictionary (metastore)

•MPP (Massively Parallel Processing) query engine 
that runs within Hadoop

‣Uses same file formats, security, 
resource management as Hadoop

•Processes queries in-memory

•Accesses standard HDFS file data

•Option to use Apache AVRO, RCFile, 
LZO or Parquet (column-store)

•Designed for interactive, real-time 
SQL-like access to Hadoop
How Impala Works
Impala
Hadoop
HDFS etc
BI Server
Presentation Svr
Cloudera Impala 
ODBC Driver
Impala
Hadoop
HDFS etc
Impala
Hadoop
HDFS etc
Impala
Hadoop
HDFS etc
Impala
Hadoop
HDFS etc
Multi-Node 
Hadoop Cluster

but sometimes,
you need the real thing

•Originally part of Oracle Big Data 4.0 (BDA-only)

‣Also required Oracle Database 12c, Oracle Exadata Database Machine

•Extends Oracle Data Dictionary to cover Hive

•Extends Oracle SQL and SmartScan to Hadoop

•Extends Oracle Security Model over Hadoop

‣Fine-grained access control

‣Data redaction, data masking

‣Uses fast c-based readers where possible 
(vs. Hive MapReduce generation)

‣Map Hadoop parallelism to Oracle PQ

‣Big Data SQL engine works on top of YARN

‣Like Spark, Tez, MR2
Oracle Big Data SQL
Exadata 
Storage Servers
Hadoop 
Cluster
Exadata Database 
Server
Oracle Big 
Data SQL
SQL Queries
SmartScan SmartScan

•As with other next-gen SQL access layers, uses common Hive metastore table metadata

•leverages Hadoop standard APIs for HDFS file access, metadata integration etc
Leverages Hive Metastore and Hadoop file access APIs

•Brings query-offloading features of Exadata 
to Oracle Big Data Appliance

•Query across both Oracle and Hadoop sources

•Intelligent query optimisation applies SmartScan 
close to ALL data

•Use same SQL dialect across both sources

•Apply same security rules, policies,  
user access rights across both sources
Extending SmartScan, and Oracle SQL, Across All Data

•Read data from HDFS Data Node

‣Direct-path reads

‣C-based readers when possible

‣Use native Hadoop classes otherwise 
•Translate bytes to Oracle 
•Apply SmartScan to Oracle bytes

‣Apply filters

‣Project columns

‣Parse JSON/XML

‣Score models
How Big Data SQL Accesses Hadoop (HDFS) Data
Disks%
Data$Node$
Big$Data$SQL$Server$
External$Table$Services$
Smart$Scan$
RecordReader%
SerDe%
10110010%10110010%10110010%
1%
2%
3%
1
2
3

•“Query Franchising – dispatch of query processing to self-similar compute agents on
disparate systems without loss of operational fidelity”
•Contrast with OBIEE which provides a query federation capability over Hadoop

•Sends sub-queries to each data source

•Relies on each data source’s native query engine, and resource management

•Query franchising using Big Data SQL ensures consistent resource management

•And contrast with SQL translation tools (i.e. Oracle SQL to Impala)

•Either limits Oracle SQL to the subset that Hive, Impala supports

•Or translation engine has to transform each Oracle feature into Hive, Impala SQL
Query Franchising vs. SQL Translation / Federation

•Oracle Database 12c 12.1.0.2.0 with Big Data SQL option can view Hive table metadata

‣Linked by Exadata configuration steps to one or more BDA clusters

•DBA_HIVE_TABLES and USER_HIVE_TABLES exposes Hive metadata

•Oracle SQL*Developer 4.0.3, with Cloudera Hive drivers, can connect to Hive metastore
View Hive Table Metadata in the Oracle Data Dictionary
SQL> col database_name for a30
SQL> col table_name for a30
SQL> select database_name, table_name
2 from dba_hive_tables;
DATABASE_NAME TABLE_NAME
------------------------------ ------------------------------
default access_per_post
default access_per_post_categories
default access_per_post_full
default apachelog
default categories
default countries
default cust
default hive_raw_apache_access_log

•Big Data SQL accesses Hive tables through external table mechanism

‣ORACLE_HIVE external table type imports Hive metastore metadata

‣ORACLE_HDFS requires metadata to be specified

•Access parameters cluster and tablename specify Hive table source and BDA cluster
Hive Access through Oracle External Tables + Hive Driver
CREATE TABLE access_per_post_categories(
hostname varchar2(100),
request_date varchar2(100),
post_id varchar2(10),
title varchar2(200),
author varchar2(100),
category varchar2(100),
ip_integer number)
organization external
(type oracle_hive
default directory default_dir
access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));

•Run normal Oracle SQL from the Oracle Database server

•Big Data SQL query franchising then uses agents on Hadoop nodes to query and return
data independent of YARN scheduling; Oracle Database combines and returns full results
Running Oracle SQL on Hadoop Data Nodes
SELECT w.sess_id,w.cust_id,c.name
FROM web_logs w, customers c
WHERE w.source_country = ‘Brazil’
AND c.customer_id = w.cust_id

•OBIEE can access Hadoop data via Hive, but it’s slow

•(Impala only has subset of Oracle SQL capabilities)

•Big Data SQL presents all data to OBIEE as Oracle data, with full advanced analytic
capabilities across both platforms
Example : Combining Hadoop + Oracle Data for BI
Hive Weblog Activity table
Oracle Dimension lookup tables
Combined output 
in report form

•Not all functions can be offloaded to Hadoop tier

•Even for non-offloadable operations Big Data SQL will perform column pruning and
datatype conversion (which saves a lot of resources)

•Other operations (non-offloadable) will be done on the database side

•Requires Oracle Database 12.1.0.2 + patchset, and per-disk licensing for Big Data SQL

•You need and Oracle Big Data Appliance, and Oracle Exadata, to use Big Data SQL
Restrictions when using Oracle Big Data SQL
SELECT NAME FROM v$sqlfn_metadata WHERE offloadable ='YES'

•From Big Data SQL 3.0, commodity hardware can be used instead of BDA and Exadata

•Oracle Database 12.1.0.2 on x86_64 with Jan/Apr Proactive Bundle Patches

•Cloudera CDH 5.5 or Hortonworks HDP 2.3 on RHEL/OEL6

•See MOS Doc ID 2119369.1 - note cannot mix Engineered/Non-Engineered platforms
Running Big Data SQL on Commodity Hardware

•No functional differences when running Big Data SQL on commodity hardware

•External table capability lives with the database, and the performance functionality with
the BDS cell software.

•All BDS features (SmartScan, offloading, storage indexes etc still available)

•But hardware can be a factor now, as we’re pushing processing down and data up the wire

•1GB ethernet can be too slow, 10Gb is a minimum (i.e. no InfiniBand)

•If you run on an undersized system you may see bottlenecks on the DB side.
Big Data SQL on Commodity Hardware Considerations

But Hadoop is more than  
simple HDFS files

•Subsequent releases of Big Data SQL have extended its Hadoop capabilties

•Support for Hive storage handlers (HBase, MongoDB etc)

•Hive partition elimination

•Better, more efficient access to Hadoop data

•Storage Indexes

•Predicate Push-Down for Parquet, ORC, HBase, Oracle NoSQL

•Bloom Filters

•Coming with Oracle Database 12.2

•Big Data-aware optimizer

•Dense Bloom Filters

•Oracle managed Big Data partitions
Going beyond Fast Unified Query Access to HDFS Data

•Hive Storage handlers give Hive the ability
to access data from non-HDFS sources

•MongoDB

•HBase

•Oracle NoSQL database

•Run HiveQL queries against NoSQL DBs

•From BDS1.1, Hive storage handlers can
be used with Big Data SQL

•Only MongoDB, HBase and NoSQL
currently “supported”

•Others should work but not tested
Big Data SQL and Hive Storage Handlers

•Create Hive table over HBase database as normal

•Typically done to add INSERT and DELETE capabilities to Hive, for DW dimension ETL

•Create Oracle external table as normal, using ORACLE_HIVE driver
Use of Hive Storage Handlers Transparent to BDS
CREATE EXTERNAL TABLE tablename colname coltype[, colname coltype,...]
ROW FORMAT
SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'hbase.columns.mapping'=':key,value:key,value:
CREATE TABLE tablename(colname colType[, colname colType...])
ORGANIZATION EXTERNAL
(TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(access parameters)
)
REJECT LIMIT UNLIMITED;

•From Big Data SQL 2.0, Storage Indexes
are automatically created in Big Data SQL
agents

•Check index before reading blocks – Skip
unnecessary I/Os

•An average of 65% faster than BDS 1.x

•Up to 100x faster for highly selective
queries

•Columns in SQL are mapped to fields in
the HDFS file via External Table Definitions

•Min / max value is recorded for each
HDFS Block in a storage index
Big Data SQL Storage Indexes

•Hadoop supports predicate push-down through several mechanisms (filetypes, Hive
partition pruning etc)

•Original BDS 1.0 supported Hive predicate push-down as part of SmartScan

•BDS 3.0 extends this by pushing SARGable (Search ARGument ABLE) predicates

•Into Parquet and ORCFile to reduce I/O when  
reading files from disk

•Into HBAse and Oracle NoSQL database  
to drive subscans of data from remote DB

•Oracle Database 12.2 will add more optimisations

•Columnar-caching

•Big Data-Aware Query Optimizer,

•Managed Hadoop partitions

•Dense Bloom Filters
Extending Predicate Push-Down Beyond Hive

•Typically a one-way street - queries run in Hadoop but results delivered through Oracle

•What if you want to load data into Hadoop, update data, do Hadoop>Hadoop transforms?

•Still requires formal Hive metadata, whereas direction is towards Drill & schema-free queries

•What if you have other RDBMSs as well as Oracle RDBMS?

•Trend is towards moving all high-end analytic workloads into Hadoop - BDS is Oracle-only

•Requires Oracle 12c database, no 11g support

•And cost … BDS is $3k/Hadoop disk drive

•Can cost more than an Oracle BDA

•High-end, high-cost Oracle-centric solution

•of course!
… So What’s the Catch?

•Rich, comprehensive SQL access to all enterprise data

•Extend Oracle security, advanced analytic features and metadata across Hadoop & NoSQL
Oracle Big Data SQL Vision : Unified Query

http://www.rittmanmead.com

Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Warehouse

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Warehouse (20)

More from Mark Rittman (12)

Recently uploaded (20)

Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Warehouse