SlideShare a Scribd company logo
Data + Science = DataScience
P r e s e n t e d b y :
eRic Choo
Scaling up with Cisco Big Data
Big Data Products-Solutions Stack
Infrastructure - Servers, Storage, Data Protection & Retention Solutions
Business Intelligence
Data Mining & Business Analytics
Big Data Virtualization & Systems Integration
What you will be hearing
WHAT
AND WHY?
Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009 in
UC Berkeley’s AMPLab
• Fully open sourced in 2010 – now
a Top Level Project at the Apache
Software Foundation
• Fast Growing Community
What is Spark?
Spark
Streaming
batches of X seconds
IoT live data stream
processed results
Understand
ExploreModel
Assess
Data
Science
Hadoop
• MapReduce is powerful, but hard
• Spark aims to be both powerful and easy for processing
• How does it do it?
– A more generalized form of MapReduce
– Elements transformed in parallel
– Memory Cache-ing
– Supports Python & Scala, along with Java
What is Spark? An Execution Engine on Top of Hadoop
Map ReduceInput Output
Reduce
Input
Output
Spark advantages for the end user
Faster Development & Data Pipelining
• Simple, easy-to-understand programming
abstraction with an interactive shell
• APIs for Java, Python and Scala
• Enables reuse of code across batch,
interactive and streaming applications
e.g. calling machine learning library
routines in Spark SQL
In-Memory Performance
• General-purpose execution graphs
• In-memory pipelining to achieve maximum
performance without persisting
intermediate results to disk
Popular use cases include ETL, Machine Learning and Real-time Analytics
Easy to Develop Applications – Example
2-5x less code
Hadoop with Speed Advantages - Example
Logistic regression in Hadoop
MapReduce and Hadoop with Spark
Hadoop MR
Hadoop w/ Spark
Up to 10x faster on disk,
100x faster in memory
Scaling up with Cisco Big Data: Data + Science = Data Science
MAPR SUPPORT FOR SPARK
MapR –Integration and Support of Apache Spark Stack
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Spark
Streaming
Storm
StreamingNoSQL &
Search
Juju
Provisioning
&
Coordination
Sahara
ML, Graph
Mahout
MLLib
GraphX
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Pig
Cascading
Spark
Batch
MapReduce
v1 & v2
Tez
HBase
Solr
Hive
Impala
Spark SQL
Drill
SQL
Sentry Oozie ZooKeeperSqoop
Flume
Data
Integration
& Access
HttpFS
Hue
Data PlatformMapR-FS MapR-DB
Management
Spark Stack Offers Variety of Functionality…
Spark SQL
(SQL)
Spark Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN
Spark on MapR Advantages
World-record performance on disk coupled
with in-memory processing advantages
High Performance
Industry-leading enterprise-grade features for
the Spark stack
Enterprise-grade Applications
Strategic partnership with Databricks to
ensure enterprise support for the entire stack
24/7 Best-in-class Global Support
MapR-DB + Spark on one Hadoop cluster
allows for real-time as-it-happens analytics
Operational DataStore + Spark
Scaling up with Cisco Big Data: Data + Science = Data Science
SPARK USE CASES
Cisco: Security Intelligence Operations
Sensor data lands in MapR
Spark Streaming on MapR for
first check on known threats
Data next processed on GraphX
and Mahout
Additional SQL querying done
via Spark SQL and Impala
Complex
Data Pipelining
without MapReduce
Industry Leading Ad-Targeting Platform:
Real-time Decisions
High performance analytics
over MapR-DB
Load from MapR-DB table into
RDD to augment scoring
Results stored back in MapR-DB
for other applications
Real-time Analytics
over NoSQL
Addressing Health
Care Regulations
Patient information in MapR-DB
combined with clinical records to
compute re-admittance
probability
Process uses Spark with
transactional data in MapR-DB
Deploy home health services to
prevent re-admittance
Real-time Analytics
over NoSQL
Streaming Use Cases
• Manufacturing & Internet of Things: Real-time, adaptive analysis of machine data (e.g.,
sensors, control parameters, alarms, notifications, maintenance logs, and imaging results)
from industrial systems (e.g., equipment, plant, fleet) for visibility into asset health, proactive
maintenance planning, and optimized operations.
• Fraud Management: Real-time analysis of business communication and accounting
transactions to detect unusual activities.
• Marketing & Sales: Analysis of customer engagement and conversion, powering real-time
recommendations while customers are still on the site or in the store
• Customer Service & Billing: Analysis of contact center interactions, enabling accurate
remote trouble shooting before expensive field technicians are dispatched
• Information Technology: Log processing to detect unusual events occurring in stream(s) of
data, so that IT can take remedial action before service quality degrades
Real-time Analytics
over Streaming
Scaling up with Cisco Big Data: Data + Science = Data Science
SPARK DEMO
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
DATA SCIENCE
Data Science
• What is Data Science
– Extraction of knowledge from data
employing math, statistics and information
theory (Probability model, machine learning
and etc.)
Source: Wikipedia
Data Analytics/Science Development Cycle
Challenges
• Data Science knowledge
required
• Multiple models for testing
• Multiple ways of tuning testing
data
• Multiple iterations of testing
• Stabilizing results
Benefits of Automation
• Data Science knowledge built into
platform
• Automated testing of multiple
models
• Selection of most accurate models
• Reduced iterative testing time
• Effective use of Data Science
Resources
• Higher productivity and lower
cost
Basic Data Science Categories
• Supervised Learning • Unsupervised Learning
Supervised Learning
• Labelling of data according to a labelled training set
• Example
– I know that it will rain when
• Sky is dark
• More moisture in the air
• Its is near raining session
– Question:
• In the current weather will it rain
• Type of algorithms
– Naive Bayes
– Linear Regression
– Decision Trees
Unsupervised Learning
• Example:
– I have a set of data collected regarding weather
– I have multiple other set of data that are non
related to the weather. ie. forest fire data from
nearby region, etc.
– Are there any relation between the data set?
• Type of algorithms
– K-mean
– Fuzzy Clustering
Scaling up with Cisco Big Data: Data + Science = Data Science
DATA SCIENCE USE CASES
TRAFFIC ANALYTICS
Traffic Analytics
Scaling up with Cisco Big Data: Data + Science = Data Science
TEXT ANALYTICS
Text Analytics
CLUSTER DOCUMENTS
Hadoop
Text Documents
MAHOUT (Data Science Tool)
MapReduce
Text Analytics
CLUSTER 1 CLUSTER 2
CLUSTER 3 CLUSTER 4
Hadoop
Text Documents
MAHOUT (Data Science Tool)
MapReduce
Categorizing into Topics/Stories
CLUSTER 1 CLUSTER 2
CLUSTER 3 CLUSTER 4
CONSTRUCT STORIES
TOP TERMS CL 1
Technology
3D Printing
Steve Jobs
Sports Wear
…
TOP TERMS CL 2
United Nations
Dogs
Camera
Internet of Things
…
CATEGORY : INNOVATION CATEGORY : SECURITY
Term Document Matrix
Word1
Word2
Word3
Word4
Word5
Word6
Word7
Word8
Word9
Word
10
Word
11
FILE 1
FILE 2
FILE 3
FILE 4
FILE 5
FILE 6
FILE 7
Cluster 1:
FILE 1
FILE 6
Cluster 2:
FILE 2
FILE 5
Cluster 3:
FILE 3
FILE 4
Cluster 4:
FILE 7
Scaling up with Cisco Big Data: Data + Science = Data Science
SENTIMENT ANALYTICS
Sentiment Analysis
TWITTER (DATA IN JSON FORMAT)
Field Value
For Country United States
By individual State
Analyze Tweets
Objective : To find out the level of happiness of a State in USA
Sentiment Dictionary
Sentiment Dictionary
AFINN-111
Sentiment Score Computation
San Francisco
Los Angeles
New York
Chicago
Boston
San Diego
Score at tweet level for CA
Score at tweet level for CA
Score at tweet level for CA
Summing
up the
tweet level
scores for
each state
Results in Sentiment Analysis
Happy States
Unhappy States
Results in an example of Simple Visualization
Results in an example of Complex Visualization
Decision with Analytics Support
DECISION
Hadoop
Social Media
Data
Text AnalyticsData Science
SQLStructuredQueryLanguage
Scaling up with Cisco Big Data: Data + Science = Data Science
DATA SCIENCE AUTOMATION
DEMO
Data Science Automation
DataRobot is a platform that lets Data Scientist automates the entire model life
cycle process which is very serialized and time consuming. This life cycle
includes:
1. Pre-processing and feature engineering
2. Algorithm identification to build predictive model(s)
3. Training, testing, and validating of models
4. Building of deployment scripts for model deployment to provide business
insight
Scaling up with Cisco Big Data: Data + Science = Data Science
CISCO – MAPR DATA
ANALYTICS USE CASE
Quantium captures new niche in data analytics market
MapR Distribution for Apache Hadoop and
Cisco UCS cut query time by 92 percent,
improve accuracy of results
“ With the Cisco-MapR platform, Quantium has positioned itself to stay well ahead of our
competitors for the foreseeable future.” https://marketplace.cisco.com/catalog/products/3344
- Alex Shaw, Head of Technology Operations, Quantium
Hosted on Cisco infrastructure, MapR
Distribution for Hadoop meets Quantium’s strict
requirements
To meet its challenges, Quantium assembled a team of data scientists from across the business. The team created a set of requirements
and evaluated the available software and hardware solutions on the market.
“Decisions about the new platform would affect Quantium’s business for years to come, so we invested a significant amount of time
and money in the selection process,”
- Alex Shaw, Quantium’s Head of Technology Operations
“The POC demonstrated that MapR performs better than the competition. The
MapR file system gives us maximum control over how we store information within
the data volumes and has good security features.”
- Alex Shaw, Quantium’s Head of Technology Operations
• Quantium realized that a big data solution was needed, not only because
of the data volume but also the heavy analytical requirements.
• While the team chose Hadoop as the big
data software solution, they still needed
to choose the best distribution from
among the top-tier Hadoop vendors (see
figure 1).
• The first stage of the process, a thorough
analysis of features and benefits,
narrowed the field to MapR and one
other competitor.
• Performance of new platform exceeds targets
• Unique business model outpaces competitors
• Greater innovation, shorter time to market
“Having access to external data sets to combine with
our clients’ data distances us from everybody else in
this space,”
“We have a lot of smart people who have been
hamstrung by technology and its ability to implement
their ideas. Now they have improved ways of executing
analytics which opens up the ability to create new and
innovative solutions for our clients”
- Alex Shaw, Quantium’s Head of Technology Operations
• Scaling to accommodate business growth
• Multi-tenancy model safeguards client information
“MapR incorporates data partitioning
via the Volumes feature, which allows us
to logically segregate individual data
sets while optimizing data storage for
optimum performance,”
- Alex Shaw, Quantium’s Head of Technology
Operations
Extending the Quantium approach to new
markets
“We’ve expanded the range of problems that we can
solve, enabling our clients to grow their business by
interacting with each of their customers as individuals
with specific wants and needs,”
“With the Cisco-MapR platform, Quantium has
positioned itself to stay well ahead of our competitors
for the foreseeable future.”
- Alex Shaw, Quantium’s Head of Technology Operations
WORLD’S LARGEST BIOMETRIC IDENTITY
SYSTEM: AADHAAR EXPERIENCE
World's Largest Biometric Identity System: Aadhaar Experience
• 1.2 billion residents
– 640,000 villages, ~60% under $2/day, ~75% literacy,
– <3% pays Income Tax, <20% banking,
– ~1 billion mobile connections
– ~300-400m migrant workers
• $50 billion direct subsidies every year!
– Residents have no standard verifiable identity
– Most programs plagued with ghost and multiple
identities causing leakage of 20-40%
Demographic Data
• Compulsory data:
– Name, Age/Date of Birth,
Gender and
– Address of the resident
• Optional data:
– Mobile number
– Email address
Biometric Data
Photograph
All 10
fingerprints
Both Iris
World's Largest Biometric Identity System: Aadhaar Experience
12-digit Aadhaar Number
Unique, lifetime, biometric based identity
World's Largest Biometric Identity System: Aadhaar Experience
Concluding Spark?
Spark
Streaming
batches of X seconds
IoT live data stream
processed results
Hadoop
Understand
ExploreModel
Assess
Data
Science
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
Big Data Implementation Road Map
PLAN BUILD MANAGE
Understand
ExploreModel
Assess
Discovery
Workshop
Proof of
Concept
Validation
Plan, Design,
Implement
Support /
Managed
Services
Please take some time to fill up the
feedback form and the Question Sheet
Ad

Recommended

PDF
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark Summit
 
PPTX
Predictive Analytics with Hadoop
DataWorks Summit
 
PPTX
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
DataWorks Summit/Hadoop Summit
 
PPTX
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 
PDF
The Future of Data Science
DataWorks Summit
 
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
PPTX
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Cesare Cugnasco
 
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PDF
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 
PPTX
Keys for Success from Streams to Queries
DataWorks Summit/Hadoop Summit
 
PDF
MapR & Skytree:
MapR Technologies
 
PPTX
Geospatial data platform at Uber
DataWorks Summit
 
PPTX
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
DataWorks Summit/Hadoop Summit
 
PPTX
Machine Learning with Spark
elephantscale
 
PDF
Best Practices for Protecting Sensitive Data Across the Big Data Platform
MapR Technologies
 
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
PPTX
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 
PPTX
TechEvent Databricks on Azure
Trivadis
 
PDF
Evolving Hadoop into an Operational Platform with Data Applications
DataWorks Summit
 
PDF
Big data with java
Stefan Angelov
 
PPTX
Leveraging advanced technologies to support critical applications in a secure...
DataWorks Summit
 
PPTX
Designing data pipelines for analytics and machine learning in industrial set...
DataWorks Summit
 
PPTX
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
KEY
Indexing thousands of writes per second with redis
pauldix
 
PPTX
Greenplum- an opensource
Rosy Mani
 

More Related Content

What's hot (20)

PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PDF
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 
PPTX
Keys for Success from Streams to Queries
DataWorks Summit/Hadoop Summit
 
PDF
MapR & Skytree:
MapR Technologies
 
PPTX
Geospatial data platform at Uber
DataWorks Summit
 
PPTX
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
DataWorks Summit/Hadoop Summit
 
PPTX
Machine Learning with Spark
elephantscale
 
PDF
Best Practices for Protecting Sensitive Data Across the Big Data Platform
MapR Technologies
 
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
PPTX
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 
PPTX
TechEvent Databricks on Azure
Trivadis
 
PDF
Evolving Hadoop into an Operational Platform with Data Applications
DataWorks Summit
 
PDF
Big data with java
Stefan Angelov
 
PPTX
Leveraging advanced technologies to support critical applications in a secure...
DataWorks Summit
 
PPTX
Designing data pipelines for analytics and machine learning in industrial set...
DataWorks Summit
 
PPTX
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 
Keys for Success from Streams to Queries
DataWorks Summit/Hadoop Summit
 
MapR & Skytree:
MapR Technologies
 
Geospatial data platform at Uber
DataWorks Summit
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
DataWorks Summit/Hadoop Summit
 
Machine Learning with Spark
elephantscale
 
Best Practices for Protecting Sensitive Data Across the Big Data Platform
MapR Technologies
 
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 
TechEvent Databricks on Azure
Trivadis
 
Evolving Hadoop into an Operational Platform with Data Applications
DataWorks Summit
 
Big data with java
Stefan Angelov
 
Leveraging advanced technologies to support critical applications in a secure...
DataWorks Summit
 
Designing data pipelines for analytics and machine learning in industrial set...
DataWorks Summit
 
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

KEY
Indexing thousands of writes per second with redis
pauldix
 
PPTX
Greenplum- an opensource
Rosy Mani
 
PDF
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
ETCenter
 
PDF
Data science
9diov
 
PDF
Creating a contemporary risk management system using python (dc)
Piero Ferrante
 
PPTX
DataScience and BigData Cebu 1st meetup
Francisco Liwa
 
PDF
International Collaboration Networks in the Emerging (Big) Data Science
datasciencekorea
 
PDF
The Role of Data Science in Enterprise Risk Management, Presented by John Liu
NashvilleTechCouncil
 
PDF
Fiche Produit Verteego Data Suite, mars 2017
Jeremy Fain
 
PDF
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Spark Summit
 
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Spark Summit
 
PDF
Data Visualisation for Data Science
Christophe Bontemps
 
PDF
Introduction to Data Science and Large-scale Machine Learning
Nik Spirin
 
PDF
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Edureka!
 
PPTX
VU University Amsterdam - The Social Web 2016 - Lecture 4
Davide Ceolin
 
PPTX
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
Edureka!
 
PDF
Agile Data Science 2.0 - Big Data Science Meetup
Russell Jurney
 
PDF
Introduction to Data Science
ANOOP V S
 
KEY
Intro to Data Science for Enterprise Big Data
Paco Nathan
 
PPTX
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
Elvis Muyanja
 
Indexing thousands of writes per second with redis
pauldix
 
Greenplum- an opensource
Rosy Mani
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
ETCenter
 
Data science
9diov
 
Creating a contemporary risk management system using python (dc)
Piero Ferrante
 
DataScience and BigData Cebu 1st meetup
Francisco Liwa
 
International Collaboration Networks in the Emerging (Big) Data Science
datasciencekorea
 
The Role of Data Science in Enterprise Risk Management, Presented by John Liu
NashvilleTechCouncil
 
Fiche Produit Verteego Data Suite, mars 2017
Jeremy Fain
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Spark Summit
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Spark Summit
 
Data Visualisation for Data Science
Christophe Bontemps
 
Introduction to Data Science and Large-scale Machine Learning
Nik Spirin
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Edureka!
 
VU University Amsterdam - The Social Web 2016 - Lecture 4
Davide Ceolin
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
Edureka!
 
Agile Data Science 2.0 - Big Data Science Meetup
Russell Jurney
 
Introduction to Data Science
ANOOP V S
 
Intro to Data Science for Enterprise Big Data
Paco Nathan
 
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
Elvis Muyanja
 
Ad

Similar to Scaling up with Cisco Big Data: Data + Science = Data Science (20)

PPTX
basic of data science and big data......
anjanasharma77573
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PDF
Dev Ops Training
Spark Summit
 
PPTX
The Future of Data Science
sarith divakar
 
PDF
Big data processing with apache spark
sarith divakar
 
PPTX
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
PDF
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
PDF
Data Science with Spark
Krishna Sankar
 
PDF
DevOps for DataScience
Stepan Pushkarev
 
PPTX
So your boss says you need to learn data science
Susan Ibach
 
PPTX
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
PPTX
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
PPT
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
MapR Technologies
 
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
PPTX
Hadoop as data refinery
Steve Loughran
 
PPTX
Hadoop as Data Refinery - Steve Loughran
JAX London
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
PDF
Présentation on radoop
siliconsudipt
 
basic of data science and big data......
anjanasharma77573
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Dev Ops Training
Spark Summit
 
The Future of Data Science
sarith divakar
 
Big data processing with apache spark
sarith divakar
 
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
Data Science with Spark
Krishna Sankar
 
DevOps for DataScience
Stepan Pushkarev
 
So your boss says you need to learn data science
Susan Ibach
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
MapR Technologies
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
Hadoop as data refinery
Steve Loughran
 
Hadoop as Data Refinery - Steve Loughran
JAX London
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Présentation on radoop
siliconsudipt
 
Ad

Recently uploaded (20)

PPTX
Mynd company all details what they are doing a
AniketKadam40952
 
PDF
Measurecamp Copenhagen - Consent Context
Human37
 
PDF
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
DOCX
Starbucks in the Indian market through its joint venture.
sales480687
 
PPTX
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
PPTX
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
PPTX
Data Visualisation in data science for students
confidenceascend
 
PPTX
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
PPTX
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
PDF
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
PPTX
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
PPTX
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
PDF
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
PPTX
最新版美国佐治亚大学毕业证(UGA毕业证书)原版定制
Taqyea
 
PDF
Residential Zone 4 for industrial village
MdYasinArafat13
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
PDF
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
PDF
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
PPTX
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
Mynd company all details what they are doing a
AniketKadam40952
 
Measurecamp Copenhagen - Consent Context
Human37
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
Starbucks in the Indian market through its joint venture.
sales480687
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
Data Visualisation in data science for students
confidenceascend
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
最新版美国佐治亚大学毕业证(UGA毕业证书)原版定制
Taqyea
 
Residential Zone 4 for industrial village
MdYasinArafat13
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 

Scaling up with Cisco Big Data: Data + Science = Data Science

  • 1. Data + Science = DataScience P r e s e n t e d b y : eRic Choo Scaling up with Cisco Big Data
  • 2. Big Data Products-Solutions Stack Infrastructure - Servers, Storage, Data Protection & Retention Solutions Business Intelligence Data Mining & Business Analytics Big Data Virtualization & Systems Integration
  • 3. What you will be hearing
  • 5. Apache Spark spark.apache.org github.com/apache/spark [email protected] • Originally developed in 2009 in UC Berkeley’s AMPLab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation • Fast Growing Community
  • 6. What is Spark? Spark Streaming batches of X seconds IoT live data stream processed results Understand ExploreModel Assess Data Science Hadoop
  • 7. • MapReduce is powerful, but hard • Spark aims to be both powerful and easy for processing • How does it do it? – A more generalized form of MapReduce – Elements transformed in parallel – Memory Cache-ing – Supports Python & Scala, along with Java What is Spark? An Execution Engine on Top of Hadoop Map ReduceInput Output Reduce Input Output
  • 8. Spark advantages for the end user Faster Development & Data Pipelining • Simple, easy-to-understand programming abstraction with an interactive shell • APIs for Java, Python and Scala • Enables reuse of code across batch, interactive and streaming applications e.g. calling machine learning library routines in Spark SQL In-Memory Performance • General-purpose execution graphs • In-memory pipelining to achieve maximum performance without persisting intermediate results to disk Popular use cases include ETL, Machine Learning and Real-time Analytics
  • 9. Easy to Develop Applications – Example 2-5x less code
  • 10. Hadoop with Speed Advantages - Example Logistic regression in Hadoop MapReduce and Hadoop with Spark Hadoop MR Hadoop w/ Spark Up to 10x faster on disk, 100x faster in memory
  • 13. MapR –Integration and Support of Apache Spark Stack APACHE HADOOP AND OSS ECOSYSTEM Security YARN Spark Streaming Storm StreamingNoSQL & Search Juju Provisioning & Coordination Sahara ML, Graph Mahout MLLib GraphX EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Pig Cascading Spark Batch MapReduce v1 & v2 Tez HBase Solr Hive Impala Spark SQL Drill SQL Sentry Oozie ZooKeeperSqoop Flume Data Integration & Access HttpFS Hue Data PlatformMapR-FS MapR-DB Management
  • 14. Spark Stack Offers Variety of Functionality… Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN
  • 15. Spark on MapR Advantages World-record performance on disk coupled with in-memory processing advantages High Performance Industry-leading enterprise-grade features for the Spark stack Enterprise-grade Applications Strategic partnership with Databricks to ensure enterprise support for the entire stack 24/7 Best-in-class Global Support MapR-DB + Spark on one Hadoop cluster allows for real-time as-it-happens analytics Operational DataStore + Spark
  • 18. Cisco: Security Intelligence Operations Sensor data lands in MapR Spark Streaming on MapR for first check on known threats Data next processed on GraphX and Mahout Additional SQL querying done via Spark SQL and Impala Complex Data Pipelining without MapReduce
  • 19. Industry Leading Ad-Targeting Platform: Real-time Decisions High performance analytics over MapR-DB Load from MapR-DB table into RDD to augment scoring Results stored back in MapR-DB for other applications Real-time Analytics over NoSQL
  • 20. Addressing Health Care Regulations Patient information in MapR-DB combined with clinical records to compute re-admittance probability Process uses Spark with transactional data in MapR-DB Deploy home health services to prevent re-admittance Real-time Analytics over NoSQL
  • 21. Streaming Use Cases • Manufacturing & Internet of Things: Real-time, adaptive analysis of machine data (e.g., sensors, control parameters, alarms, notifications, maintenance logs, and imaging results) from industrial systems (e.g., equipment, plant, fleet) for visibility into asset health, proactive maintenance planning, and optimized operations. • Fraud Management: Real-time analysis of business communication and accounting transactions to detect unusual activities. • Marketing & Sales: Analysis of customer engagement and conversion, powering real-time recommendations while customers are still on the site or in the store • Customer Service & Billing: Analysis of contact center interactions, enabling accurate remote trouble shooting before expensive field technicians are dispatched • Information Technology: Log processing to detect unusual events occurring in stream(s) of data, so that IT can take remedial action before service quality degrades Real-time Analytics over Streaming
  • 27. Data Science • What is Data Science – Extraction of knowledge from data employing math, statistics and information theory (Probability model, machine learning and etc.)
  • 28. Source: Wikipedia Data Analytics/Science Development Cycle Challenges • Data Science knowledge required • Multiple models for testing • Multiple ways of tuning testing data • Multiple iterations of testing • Stabilizing results Benefits of Automation • Data Science knowledge built into platform • Automated testing of multiple models • Selection of most accurate models • Reduced iterative testing time • Effective use of Data Science Resources • Higher productivity and lower cost
  • 29. Basic Data Science Categories • Supervised Learning • Unsupervised Learning
  • 30. Supervised Learning • Labelling of data according to a labelled training set • Example – I know that it will rain when • Sky is dark • More moisture in the air • Its is near raining session – Question: • In the current weather will it rain • Type of algorithms – Naive Bayes – Linear Regression – Decision Trees
  • 31. Unsupervised Learning • Example: – I have a set of data collected regarding weather – I have multiple other set of data that are non related to the weather. ie. forest fire data from nearby region, etc. – Are there any relation between the data set? • Type of algorithms – K-mean – Fuzzy Clustering
  • 38. Text Analytics CLUSTER DOCUMENTS Hadoop Text Documents MAHOUT (Data Science Tool) MapReduce
  • 39. Text Analytics CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Hadoop Text Documents MAHOUT (Data Science Tool) MapReduce
  • 40. Categorizing into Topics/Stories CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 CONSTRUCT STORIES TOP TERMS CL 1 Technology 3D Printing Steve Jobs Sports Wear … TOP TERMS CL 2 United Nations Dogs Camera Internet of Things … CATEGORY : INNOVATION CATEGORY : SECURITY
  • 41. Term Document Matrix Word1 Word2 Word3 Word4 Word5 Word6 Word7 Word8 Word9 Word 10 Word 11 FILE 1 FILE 2 FILE 3 FILE 4 FILE 5 FILE 6 FILE 7 Cluster 1: FILE 1 FILE 6 Cluster 2: FILE 2 FILE 5 Cluster 3: FILE 3 FILE 4 Cluster 4: FILE 7
  • 44. Sentiment Analysis TWITTER (DATA IN JSON FORMAT) Field Value For Country United States By individual State Analyze Tweets Objective : To find out the level of happiness of a State in USA
  • 46. Sentiment Score Computation San Francisco Los Angeles New York Chicago Boston San Diego Score at tweet level for CA Score at tweet level for CA Score at tweet level for CA Summing up the tweet level scores for each state
  • 47. Results in Sentiment Analysis Happy States Unhappy States
  • 48. Results in an example of Simple Visualization
  • 49. Results in an example of Complex Visualization
  • 50. Decision with Analytics Support DECISION Hadoop Social Media Data Text AnalyticsData Science SQLStructuredQueryLanguage
  • 53. Data Science Automation DataRobot is a platform that lets Data Scientist automates the entire model life cycle process which is very serialized and time consuming. This life cycle includes: 1. Pre-processing and feature engineering 2. Algorithm identification to build predictive model(s) 3. Training, testing, and validating of models 4. Building of deployment scripts for model deployment to provide business insight
  • 55. CISCO – MAPR DATA ANALYTICS USE CASE
  • 56. Quantium captures new niche in data analytics market MapR Distribution for Apache Hadoop and Cisco UCS cut query time by 92 percent, improve accuracy of results “ With the Cisco-MapR platform, Quantium has positioned itself to stay well ahead of our competitors for the foreseeable future.” https://marketplace.cisco.com/catalog/products/3344 - Alex Shaw, Head of Technology Operations, Quantium
  • 57. Hosted on Cisco infrastructure, MapR Distribution for Hadoop meets Quantium’s strict requirements To meet its challenges, Quantium assembled a team of data scientists from across the business. The team created a set of requirements and evaluated the available software and hardware solutions on the market. “Decisions about the new platform would affect Quantium’s business for years to come, so we invested a significant amount of time and money in the selection process,” - Alex Shaw, Quantium’s Head of Technology Operations
  • 58. “The POC demonstrated that MapR performs better than the competition. The MapR file system gives us maximum control over how we store information within the data volumes and has good security features.” - Alex Shaw, Quantium’s Head of Technology Operations • Quantium realized that a big data solution was needed, not only because of the data volume but also the heavy analytical requirements. • While the team chose Hadoop as the big data software solution, they still needed to choose the best distribution from among the top-tier Hadoop vendors (see figure 1). • The first stage of the process, a thorough analysis of features and benefits, narrowed the field to MapR and one other competitor.
  • 59. • Performance of new platform exceeds targets • Unique business model outpaces competitors • Greater innovation, shorter time to market “Having access to external data sets to combine with our clients’ data distances us from everybody else in this space,” “We have a lot of smart people who have been hamstrung by technology and its ability to implement their ideas. Now they have improved ways of executing analytics which opens up the ability to create new and innovative solutions for our clients” - Alex Shaw, Quantium’s Head of Technology Operations
  • 60. • Scaling to accommodate business growth • Multi-tenancy model safeguards client information “MapR incorporates data partitioning via the Volumes feature, which allows us to logically segregate individual data sets while optimizing data storage for optimum performance,” - Alex Shaw, Quantium’s Head of Technology Operations
  • 61. Extending the Quantium approach to new markets “We’ve expanded the range of problems that we can solve, enabling our clients to grow their business by interacting with each of their customers as individuals with specific wants and needs,” “With the Cisco-MapR platform, Quantium has positioned itself to stay well ahead of our competitors for the foreseeable future.” - Alex Shaw, Quantium’s Head of Technology Operations
  • 62. WORLD’S LARGEST BIOMETRIC IDENTITY SYSTEM: AADHAAR EXPERIENCE
  • 63. World's Largest Biometric Identity System: Aadhaar Experience • 1.2 billion residents – 640,000 villages, ~60% under $2/day, ~75% literacy, – <3% pays Income Tax, <20% banking, – ~1 billion mobile connections – ~300-400m migrant workers • $50 billion direct subsidies every year! – Residents have no standard verifiable identity – Most programs plagued with ghost and multiple identities causing leakage of 20-40%
  • 64. Demographic Data • Compulsory data: – Name, Age/Date of Birth, Gender and – Address of the resident • Optional data: – Mobile number – Email address Biometric Data Photograph All 10 fingerprints Both Iris World's Largest Biometric Identity System: Aadhaar Experience 12-digit Aadhaar Number Unique, lifetime, biometric based identity
  • 65. World's Largest Biometric Identity System: Aadhaar Experience
  • 66. Concluding Spark? Spark Streaming batches of X seconds IoT live data stream processed results Hadoop Understand ExploreModel Assess Data Science
  • 69. Big Data Implementation Road Map PLAN BUILD MANAGE Understand ExploreModel Assess Discovery Workshop Proof of Concept Validation Plan, Design, Implement Support / Managed Services
  • 70. Please take some time to fill up the feedback form and the Question Sheet