Running Cognos on Hadoop

RUNNING COGNOS ON HADOOP
Cost Effective, Highly Scalable, High Speed

• Introduction
• Running Cognos on Hadoop
– Hadoop Overview
– Hive Overview
– Performance
– BigSheets Demo
• Additional Resources
Agenda
2Copyright 2016 Senturus, Inc. All Rights Reserved

Paul Yip
BigInsights Product Manager
IBM Analytics
Introduction: Today’s Presenters
Copyright 2016 Senturus, Inc. All Rights Reserved 3
John Peterson
CEO and Co-Founder
Senturus, Inc.

Presentation Slide Deck on www.senturus.com

Resource Library
The purpose of Senturus is to make you
successful with business analytics.
We host dozens of live webinars every
year and offer a comprehensive library of
recorded webinars, demos, white papers,
presentations, case studies, and reviews
of new software releases on our website.
Our content is constantly updated, so
visit us often to see what’s new in the
industry.
www.senturus.com/resources/
Copyright 2016 Senturus, Inc. All Rights Reserved.

This slide deck is from the webinar: Running Cognos on
Hadoop: Cost Effective, Highly Scalable, High Speed
To view the FREE recording of the presentation, and
download this deck, go to:
www.senturus.com/resources/running-cognos-on-hadoop/
Hear the Recording

… AND THE TOOLS TO CONFRONT THEM
THE CHALLENGES OF DATA TODAY

THE BIG MATCH UP
8Copyright 2016 Senturus, Inc. All Rights Reserved.
MEETS * Existing & new systems
The 4 V’s
of Data
• Volume
• Variety
• Veracity
• Velocity
Business Analytics*
• Standard (highly
formatted) Reports
• Dashboards
• Ad Hoc Analysis
• Alerts, etc.
• Predictive Analytics

• Virtually unlimited, low-cost Staging Area
that accepts all data formats
• Easy way to Explore raw data
• Low-cost Archive for past or less used data
• Repository for transformed data (subset)
that fully supports queries from standard BI
tools (typically SQL)
SO WHAT YOU NEED IS…

… that IBM is a major Hadoop stack vendor
DID YOU KNOW?
* Other Hadoop vendors include: Amazon, Microsoft, Intel, Pivotal/EMC, Teradata…

How is your organization combining SQL,
standard BI tools and Hadoop?
• Via HIVE
• Via a “value-add” tool like Impala or BigSQL
• Using Hadoop, but not SQL against it
• Not using Hadoop
• Don’t know
POLL

IBM BIGINSIGHTS & COGNOS
A POSSIBLE SOLUTION

© 2015 IBM Corporation
IBM BigInsights and Cognos
Hadoop Patterns
Paul Yip
BigInsights Product Manager
IBM Toronto Software Lab
ypaul@ca.ibm.com

© 2015 IBM Corporation15
What is Hadoop?
 General Formula
1. Bunch of commodity servers (nodes) with internal disk
• Example: 12 x 6TB disk = 72TB per node
2. Network them and install Hadoop
• Example: 20 nodes x 72TB = 1440 TB cluster
3. Result: A big file system and also runs analytics
 Features
 Significantly lower cost than SAN
 End-user / applications just see “files”
 Cluster and data is resilient
 Add performance / capacity by adding more nodes a
a
a
b
b
b
d
d
dc c
c
File1
a
b
c
d
NameNode
DataNodes

Distributed Analytics Example: MapReduce
 MapReduce computation model
 Data stored in a distributed file system spanning many inexpensive computers
 Bring function to the data
 Distribute application to the compute resources where the data is stored
 Scalable to thousands of nodes and petabytes of data
MapReduce Application
1. Map Phase
(break job into small parts)
2. Shuffle
(transfer interim output
for final processing)
3. Reduce Phase
(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context
StringTokenizer itr =
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
. . .
Distribute map
tasks to cluster
Hadoop Data Nodes

OutputReduceMap
Hive provides a SQL interface to MapReduce
SQL
Hive
 The first SQL interface for Hadoop data
 De-facto standard for SQL on Hadoop
 Ships with all major Hadoop distributions

SQL on Hadoop Matters for Big Data Analytics
For BI Tools like Cognos
Visualizations from Cognos 10.2.2

DOWN SIDE?
Sounds great! But is there a…

Hive – Joins in MapReduce
 For joins, MR is used to group data together at the same reducer based
upon the join key
 Mappers read blocks from each “table” in the join
 The <key> is the value of the join key, the <value> is the record to be joined
 Reducer receives a mix of records from each table with the same join key
 Reducers produce the results of the join
reduce
dept 1
reduce
dept 2
reduce
dept 3
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
1 map
2 map
2
1 map
employees
1011010
0101001
0011110
0111011
1
depts
select e.fname, e.lname, d.dept_name
from employees e, depts d
where e.salary > 30000
and d.dept_id = e.dept_id

N-way Joins in MapReduce
 For N-way joins involving different join keys, multiple jobs are used
reduce
dept 1
reduce
dept 2
reduce
dept 3
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1 1 map
2 map
2
1 map
employees
1011010
0101001
0011110
0111011
1
select e.fname, e.lname, d.dept_name, p.phone_type, p.phone_number
from employees e, depts d, emp_phones p
where e.salary > 30000
and d.dept_id = e.dept_id
and p.emp_id = e.emp_id
depts
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
2
1011010
0101001
0011110
0111011
1
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
2
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
2
emp_phones
(temp files)
1 map
2 map
1 map
1 map
2 map
1 map
2 map
reduce
dept 1 reduce
emp_id 1
reduce
emp_id 2
reduce
emp_id N
results
results
results

IBM BigInsights

Hive is Really 3 Things…
Storage Format, Metastore, and Execution Engine
24
SQL Execution Engine
Hive
(Open Source)
Hive Storage Model
(open source)
CSV Parquet RC Others…Tab Delim.
Hive Metastore
(open source)MapReduce
Applications

Big SQL Preserves Open Source Foundation
Leverages Hive metastore and storage formats.
No Lock-in. Data part of Hadoop, not BigSQL. Fall back to Open Source Hive Engine at any time.
25
SQL Execution Engines
IBM BigSQL
(IBM)
Hive
(Open Source)
Hive Storage Model
(open source)
CSV Parquet RC Others…Tab Delim.
Hive Metastore
(open source)
Applications

IBM First/Only to Produce Audited Benchmark
Hadoop-DS (based on TPC-DS) / Oct 2014
 Letters of attestation are
available for both Hadoop-
DS benchmarks at 10TB and
30TB scale
 InfoSizing, Transaction
Processing Performance
Council Certified Auditors
verified both IBM results as
well as results on Cloudera
Impala and HortonWorks
HIVE
 These results are for a non-
TPC benchmark. A subset of
the TPC-DS Benchmark
standard requirements was
implemented

Performance Test – Hadoop-DS (based on TPC-DS)
20 (Physical Node) Cluster
 TPC-DS stands for Transaction Processing Council – Decision Support (workload) which is
an industry standard benchmark for SQL
Hive 1.2.1
IBM Open Platform V4.1
20 Nodes
Big SQL V4.1
IBM Open Platform V4.1
20 Nodes
Updated
Oct 2015
Results

But first … is performance everything?

Big SQL runs more SQL out-of-box
Big SQL 4.1 Hive / Spark SQL 1.5.0
1 hour 3-4 weeksPorting Effort:
Big SQL is the
only engine that
can execute all 99
queries with
minimal porting
effort

Cognos & Hadoop Lessons Learned
Notes from Cognos Development
HIVE
 Very restrictive with respect to join
predicates
 Support of SQL has as history of
many limitations, such as:
• Limited set operation (union) which
had problems
• Lack of usable SQL-OLAP
 Resulting in more local processing
IMPALA
 Join restrictions are partially lifted
 Try to re-write using CROSS JOIN
(but not for outer joins)
 Cannot push FOJ
 Other gaps include
 Set operations (union, intersect
and except)
 Sub-queries
 ORDER BY (because they used to
require LIMIT N)
 Cannot have multiple distinct
aggregates
Queries on Big SQL
work out-of-box

Big SQL Security – Best In Class
Role Based Access Control Row Level Security
Colum Level Security Separation of Duties / Audit
BRANCH_A
BRANCH_B
FINANCE
See it in action on YouTube:
https://www.youtube.com/watch?v=N2FN5h25-_s

Announced at Strata + Hadoop World Sept 2015:
Big SQL V4.1 vs Hive 1.2.1 Performance Test Update
See it in action on YouTube:
www.youtube.com/watch?v=SYQgzRGhqVU

Performance Test Summary
Big SQL V4 vs. Hive 1.2.1 @ 1TB
 In 99 / 99 Queries, Big SQL was Faster
 On Average, Big SQL was 21X faster
 Excluding the Top 5 and Bottom 5 Results, Big SQL was 19X faster

IBM BigInsights

BigSheets
Browser based analytics tool for BigData
 Explore, visualize, transform
unstructured and structured data
 Visual data cleansing and
analysis
 Filter and enrich content
 Visualize Data
 Export data into common formats
No programming knowledge needed!

QUICK DEMO
BigSheets….

Major Canadian Insurance Company
BigSheets was the primary reason for choosing BigInsights. Client had huge
tables on their mainframe system and their business analysts wanted to have a
way to see the big picture opposed to sub-setting the data. (Traditional Excel way).
BigSheets allowed them to do what they needed to do and they described it as a
"game changer". - They were then able to analyze multiple tables/files that were
100GB large.
Mainframe
(small datasets, incomplete data,
spreadsheet proliferation)
db
subsets
BigInsights (complete data,
centralized data)
before
after

BigSheets Empowers Business Users on Hadoop
Tara Paider @ Nationwide
AVP, IT Architecture, Data Management & Analytics
"Nationwide runs Monte Carlo simulations. The data preparation step
normally takes approximately 3 days.
Frustrated with this, as part of a POC, business analysts re-implemented the
transformations on IBM BigInsights and it now runs in 10 minutes.*
Technologies within BigInsights enabled business users to leverage the
power of Hadoop and Map Reduce without advanced programming skills."
*The performance improvement is attributed to pushing computation and transformations close to the data
and the scale out capability of Hadoop.

Putting it all together…
Big Data Technology Patterns

Information Movement & Transformation
Traditional Enterprise Analytic Environment
Data
Sources
Structured
Operational
Staging
Area
Task: Extract Operational Data to Staging Area
Task: Normalize Data for Enterprise Consumability
Task: Provide Guided and Interactive Access
Information
and Insight
Marts BI
Performance
Management
Predictive
Analytics and
Modeling
Task: Deliver Data for Deeper Analysis and Modeling
EDW
Archive
Task: Archive “Cold” Data to Reduce Costs
Task: Tooling to Facilitate Data Movement & Transformation

Traditional Approach to Improve Analytic Architectures
Data
Sources
Structured
Operational
Information
and Insight
BI
Performance
Management
Predictive
Analytics and
Modeling
Archive
Marts
Expanded
EDW
Staging
Area
Put Staging Area in the EDW
Still only structured (models don’t get more aperture)
Doesn’t accelerate the speed of model lifecycles
$$ Expensive $$

Faster, Deeper Insights While Reducing Costs
Enterprise
Warehouse
Structured
Operational
BI
Performance
Management
Predictive
Analytics and
Modeling
Marts
Faster Performance – up to 4-digit improvements
Reduces storage 10x – 25x
Load and Go (no tuning, +++)
Staging
Area
Landing
Exploration
Exploration
Discovery
Archive
Sensor
Geospatial
Time Series
Unstructured
External
Social
Day 0 Archive
Streaming

Current State: Analytics Development Cycle
Request
IT for data
extraction
Gather
Requirements
Data Integration
Effort Estimates
Solution Design
Infrastructure
Cost Analysis
Business
Case
Development
GO or NO-GO
Management
Approvals
Data Quality &
ETL
Development
Report
Development
Review
Results…
6 Months later
Typical Phases…. High cost of failure
• Business Analysts
• Solution Architects
• Infrastructure Architects
• Database Administrators
• SAN Administrators
• Management
• ETL Developers
• Report Developers
Actors Involved
Bright Idea!
Procurement
DB, storage
SW licenses

Target State: Rapid Prototyping
Land
data in
Hadoop
(if not
already)
Explore
Data with
BigSheets
Prototype
Reports
with
Cognos/
BigSQL
GO or No-Go
Management
Approvals
Infrastructure
Cost Analysis
Data Quality &
ETL
Development
Operationalization
Develop a culture of fail fast
Still Required for “Go To Production”
Elapsed Time: Days/WeeksMany Ideas…
• More Precise Requirements Gathering
• More Information about Data Quality
• More Accurate Project Estimates
• More Reliable Business Cases
• Actionable Insights before Production Ready Solution

Right Tool, Right Job
Text Analytics
Unstructured Semi Structured Structured
Big Sheets, Big SQL
Cognos w/Big SQL
Cognos +
RDBMS

Summary
 Hadoop is a Big File System that can run analytics
 Unlike a database, it can store anything as files (structured, semi-structured,
unstructured)
 Hadoop lowers the cost to pennies per GB – making it possible to have a copy of all data
from source systems
 SQL on Hadoop enabled analytics for the masses – no need to learn map reduce
 IBM BigInsights  BigSheets and Big SQL enables discovery and rapid BI prototyping
“Big SQL makes access to
Hive data faster and more secure”

Other Resources….
 Watch Big SQL take on Spark SQL
 https://www.youtube.com/watch?v=bAs74frPUq8
 Improve security for Hive data with Big SQL
 https://www.youtube.com/watch?v=N2FN5h25-_s

BUSINESS ANALYTICS:
ARCHITECTED TO SCALE
SENTURUS INTRODUCTION

52
• Dashboards, Reporting & Visualizations
• Data Preparation
• Big Data & Advanced Analytics
• Enterprise Planning
Laser Focused on Business Analytics

Senturus Offerings
• Comprehensive Consulting
• Dedicated Resources
• Training
• Accelerated Development Tools
• Migrations/Upgrades/Installations
• Performance & Optimization
• Jumpstarts
• Project Roadmaps & Assessments

• Big SQL Technology Sandbox is a large, shared
environment for data science
• You can use it to run R, SQL, Spark, and Hadoop jobs
• It is a high performance cluster demonstrating the
advantages of parallelized processing of big data sets
For more information, see link on Senturus website
Demo Cloud: Big SQL Technology Sandbox

You may also be interested in the following YouTube
videos authored by Paul Yip.
• Access Apache Hive Data Faster and More Securely
with Big SQL
• Spark vs IBM Big SQL Performance
• Hadoop HDFS vs Spectrum Scale (GPFS)
Related Videos

Free Resources on www.senturus.com

www.senturus.com/events
Upcoming Events

Thank You!
www.senturus.com
info@senturus.com
888 601 6010
Copyright 2016 by Senturus, Inc.
This entire presentation is copyrighted and may not be
reused or distributed without the written consent of
Senturus, Inc.

Running Cognos on Hadoop

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to Running Cognos on Hadoop (20)

More from Senturus (20)

Recently uploaded (20)

Running Cognos on Hadoop