Scaling up with Cisco Big Data: Data + Science = Data Science

Data + Science = DataScience
P r e s e n t e d b y :
eRic Choo
Scaling up with Cisco Big Data

Big Data Products-Solutions Stack
Infrastructure - Servers, Storage, Data Protection & Retention Solutions
Business Intelligence
Data Mining & Business Analytics
Big Data Virtualization & Systems Integration

Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009 in
UC Berkeley’s AMPLab
• Fully open sourced in 2010 – now
a Top Level Project at the Apache
Software Foundation
• Fast Growing Community

What is Spark?
Spark
Streaming
batches of X seconds
IoT live data stream
processed results
Understand
ExploreModel
Assess
Data
Science
Hadoop

• MapReduce is powerful, but hard
• Spark aims to be both powerful and easy for processing
• How does it do it?
– A more generalized form of MapReduce
– Elements transformed in parallel
– Memory Cache-ing
– Supports Python & Scala, along with Java
What is Spark? An Execution Engine on Top of Hadoop
Map ReduceInput Output
Reduce
Input
Output

Spark advantages for the end user
Faster Development & Data Pipelining
• Simple, easy-to-understand programming
abstraction with an interactive shell
• APIs for Java, Python and Scala
• Enables reuse of code across batch,
interactive and streaming applications
e.g. calling machine learning library
routines in Spark SQL
In-Memory Performance
• General-purpose execution graphs
• In-memory pipelining to achieve maximum
performance without persisting
intermediate results to disk
Popular use cases include ETL, Machine Learning and Real-time Analytics

Easy to Develop Applications – Example
2-5x less code

Hadoop with Speed Advantages - Example
Logistic regression in Hadoop
MapReduce and Hadoop with Spark
Hadoop MR
Hadoop w/ Spark
Up to 10x faster on disk,
100x faster in memory

Scaling up with Cisco Big Data: Data + Science = Data Science

MapR –Integration and Support of Apache Spark Stack
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Spark
Streaming
Storm
StreamingNoSQL &
Search
Juju
Provisioning
&
Coordination
Sahara
ML, Graph
Mahout
MLLib
GraphX
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Pig
Cascading
Spark
Batch
MapReduce
v1 & v2
Tez
HBase
Solr
Hive
Impala
Spark SQL
Drill
SQL
Sentry Oozie ZooKeeperSqoop
Flume
Data
Integration
& Access
HttpFS
Hue
Data PlatformMapR-FS MapR-DB
Management

Spark Stack Offers Variety of Functionality…
Spark SQL
(SQL)
Spark Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN

Spark on MapR Advantages
World-record performance on disk coupled
with in-memory processing advantages
High Performance
Industry-leading enterprise-grade features for
the Spark stack
Enterprise-grade Applications
Strategic partnership with Databricks to
ensure enterprise support for the entire stack
24/7 Best-in-class Global Support
MapR-DB + Spark on one Hadoop cluster
allows for real-time as-it-happens analytics
Operational DataStore + Spark

Cisco: Security Intelligence Operations
Sensor data lands in MapR
Spark Streaming on MapR for
first check on known threats
Data next processed on GraphX
and Mahout
Additional SQL querying done
via Spark SQL and Impala
Complex
Data Pipelining
without MapReduce

Industry Leading Ad-Targeting Platform:
Real-time Decisions
High performance analytics
over MapR-DB
Load from MapR-DB table into
RDD to augment scoring
Results stored back in MapR-DB
for other applications
Real-time Analytics
over NoSQL

Addressing Health
Care Regulations
Patient information in MapR-DB
combined with clinical records to
compute re-admittance
probability
Process uses Spark with
transactional data in MapR-DB
Deploy home health services to
prevent re-admittance
Real-time Analytics
over NoSQL

Streaming Use Cases
• Manufacturing & Internet of Things: Real-time, adaptive analysis of machine data (e.g.,
sensors, control parameters, alarms, notifications, maintenance logs, and imaging results)
from industrial systems (e.g., equipment, plant, fleet) for visibility into asset health, proactive
maintenance planning, and optimized operations.
• Fraud Management: Real-time analysis of business communication and accounting
transactions to detect unusual activities.
• Marketing & Sales: Analysis of customer engagement and conversion, powering real-time
recommendations while customers are still on the site or in the store
• Customer Service & Billing: Analysis of contact center interactions, enabling accurate
remote trouble shooting before expensive field technicians are dispatched
• Information Technology: Log processing to detect unusual events occurring in stream(s) of
data, so that IT can take remedial action before service quality degrades
Real-time Analytics
over Streaming

Data Science
• What is Data Science
– Extraction of knowledge from data
employing math, statistics and information
theory (Probability model, machine learning
and etc.)

Source: Wikipedia
Data Analytics/Science Development Cycle
Challenges
• Data Science knowledge
required
• Multiple models for testing
• Multiple ways of tuning testing
data
• Multiple iterations of testing
• Stabilizing results
Benefits of Automation
• Data Science knowledge built into
platform
• Automated testing of multiple
models
• Selection of most accurate models
• Reduced iterative testing time
• Effective use of Data Science
Resources
• Higher productivity and lower
cost

Basic Data Science Categories
• Supervised Learning • Unsupervised Learning

Supervised Learning
• Labelling of data according to a labelled training set
• Example
– I know that it will rain when
• Sky is dark
• More moisture in the air
• Its is near raining session
– Question:
• In the current weather will it rain
• Type of algorithms
– Naive Bayes
– Linear Regression
– Decision Trees

Unsupervised Learning
• Example:
– I have a set of data collected regarding weather
– I have multiple other set of data that are non
related to the weather. ie. forest fire data from
nearby region, etc.
– Are there any relation between the data set?
• Type of algorithms
– K-mean
– Fuzzy Clustering

Text Analytics
CLUSTER DOCUMENTS
Hadoop
Text Documents
MAHOUT (Data Science Tool)
MapReduce

Text Analytics
CLUSTER 1 CLUSTER 2
CLUSTER 3 CLUSTER 4
Hadoop
Text Documents
MAHOUT (Data Science Tool)
MapReduce

Categorizing into Topics/Stories
CLUSTER 1 CLUSTER 2
CLUSTER 3 CLUSTER 4
CONSTRUCT STORIES
TOP TERMS CL 1
Technology
3D Printing
Steve Jobs
Sports Wear
…
TOP TERMS CL 2
United Nations
Dogs
Camera
Internet of Things
…
CATEGORY : INNOVATION CATEGORY : SECURITY

Term Document Matrix
Word1
Word2
Word3
Word4
Word5
Word6
Word7
Word8
Word9
Word
10
Word
11
FILE 1
FILE 2
FILE 3
FILE 4
FILE 5
FILE 6
FILE 7
Cluster 1:
FILE 1
FILE 6
Cluster 2:
FILE 2
FILE 5
Cluster 3:
FILE 3
FILE 4
Cluster 4:
FILE 7

Sentiment Analysis
TWITTER (DATA IN JSON FORMAT)
Field Value
For Country United States
By individual State
Analyze Tweets
Objective : To find out the level of happiness of a State in USA

Sentiment Dictionary
Sentiment Dictionary
AFINN-111

Sentiment Score Computation
San Francisco
Los Angeles
New York
Chicago
Boston
San Diego
Score at tweet level for CA
Summing
up the
tweet level
scores for
each state

Results in Sentiment Analysis
Happy States
Unhappy States

Results in an example of Simple Visualization

Results in an example of Complex Visualization

Decision with Analytics Support
DECISION
Hadoop
Social Media
Data
Text AnalyticsData Science
SQLStructuredQueryLanguage

Data Science Automation
DataRobot is a platform that lets Data Scientist automates the entire model life
cycle process which is very serialized and time consuming. This life cycle
includes:
1. Pre-processing and feature engineering
2. Algorithm identification to build predictive model(s)
3. Training, testing, and validating of models
4. Building of deployment scripts for model deployment to provide business
insight

CISCO – MAPR DATA
ANALYTICS USE CASE

Quantium captures new niche in data analytics market
MapR Distribution for Apache Hadoop and
Cisco UCS cut query time by 92 percent,
improve accuracy of results
“ With the Cisco-MapR platform, Quantium has positioned itself to stay well ahead of our
competitors for the foreseeable future.” https://marketplace.cisco.com/catalog/products/3344
- Alex Shaw, Head of Technology Operations, Quantium

Hosted on Cisco infrastructure, MapR
Distribution for Hadoop meets Quantium’s strict
requirements
To meet its challenges, Quantium assembled a team of data scientists from across the business. The team created a set of requirements
and evaluated the available software and hardware solutions on the market.
“Decisions about the new platform would affect Quantium’s business for years to come, so we invested a significant amount of time
and money in the selection process,”
- Alex Shaw, Quantium’s Head of Technology Operations

“The POC demonstrated that MapR performs better than the competition. The
MapR file system gives us maximum control over how we store information within
the data volumes and has good security features.”
• Quantium realized that a big data solution was needed, not only because
of the data volume but also the heavy analytical requirements.
• While the team chose Hadoop as the big
data software solution, they still needed
to choose the best distribution from
among the top-tier Hadoop vendors (see
figure 1).
• The first stage of the process, a thorough
analysis of features and benefits,
narrowed the field to MapR and one
other competitor.

• Performance of new platform exceeds targets
• Unique business model outpaces competitors
• Greater innovation, shorter time to market
“Having access to external data sets to combine with
our clients’ data distances us from everybody else in
this space,”
“We have a lot of smart people who have been
hamstrung by technology and its ability to implement
their ideas. Now they have improved ways of executing
analytics which opens up the ability to create new and
innovative solutions for our clients”

• Scaling to accommodate business growth
• Multi-tenancy model safeguards client information
“MapR incorporates data partitioning
via the Volumes feature, which allows us
to logically segregate individual data
sets while optimizing data storage for
optimum performance,”
- Alex Shaw, Quantium’s Head of Technology
Operations

Extending the Quantium approach to new
markets
“We’ve expanded the range of problems that we can
solve, enabling our clients to grow their business by
interacting with each of their customers as individuals
with specific wants and needs,”
“With the Cisco-MapR platform, Quantium has
positioned itself to stay well ahead of our competitors
for the foreseeable future.”

WORLD’S LARGEST BIOMETRIC IDENTITY
SYSTEM: AADHAAR EXPERIENCE

World's Largest Biometric Identity System: Aadhaar Experience
• 1.2 billion residents
– 640,000 villages, ~60% under $2/day, ~75% literacy,
– <3% pays Income Tax, <20% banking,
– ~1 billion mobile connections
– ~300-400m migrant workers
• $50 billion direct subsidies every year!
– Residents have no standard verifiable identity
– Most programs plagued with ghost and multiple
identities causing leakage of 20-40%

Demographic Data
• Compulsory data:
– Name, Age/Date of Birth,
Gender and
– Address of the resident
• Optional data:
– Mobile number
– Email address
Biometric Data
Photograph
All 10
fingerprints
Both Iris
12-digit Aadhaar Number
Unique, lifetime, biometric based identity

Concluding Spark?
Spark
Streaming
batches of X seconds
IoT live data stream
processed results
Hadoop
Understand
ExploreModel
Assess
Data
Science

Big Data Implementation Road Map
PLAN BUILD MANAGE
Understand
ExploreModel
Assess
Discovery
Workshop
Proof of
Concept
Validation
Plan, Design,
Implement
Support /
Managed
Services

Please take some time to fill up the
feedback form and the Question Sheet

Scaling up with Cisco Big Data: Data + Science = Data Science

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Scaling up with Cisco Big Data: Data + Science = Data Science (20)

Recently uploaded (20)

Scaling up with Cisco Big Data: Data + Science = Data Science