SlideShare a Scribd company logo
© 2017 MapR Technologies
Applying Machine Learning to IOT:
End to End Distributed Pipeline for Real-
Time Uber Data Using Apache APIs: Kafka,
Spark, HBase
Carol McDonald
@caroljmcdonald
© 2017 MapR Technologies
Agenda
Using An End to End Distributed Pipeline for Real-Time Uber Data Using Apache
APIs: Kafka, Spark, Hbase we will discuss:
•  Why IOT?
•  Why combine Machine Learning with IOT?
•  What is Machine Learning? How do you do it?
•  Why Spark with Machine Learning?
•  What is Streaming?
•  Why Kafka (-ish –esque) Distributed Immutable Log ?
•  Why Spark Streaming?
•  Why Kafka + WebSockets?
•  Why NoSQL HBase?
Note: this code example is from me, only the data is from Uber
© 2017 MapR Technologies
Why IOT? Lots of Things are Producing Streaming Data
Data Collection
Devices
Smart Machinery Phones and Tablets Home Automation
RFID Systems Digital Signage Security Systems Medical Devices
© 2017 MapR Technologies
What’s a Stream ?
Producers ConsumersEvents_Stream
A stream is an unbounded sequence of events carried
from a set of producers to a set of consumers.
Events
© 2017 MapR Technologies
Why Stream Processing?
6:05 P.M.: 90°
To
pic
Stream
Temperature
Turn on the
air
conditioning!
It’s becoming important to process events as they arrive
© 2017 MapR Technologies
Why combine IOT with Machine Learning?
•  Audi and Daimler harness the power of
deep learning in order to achieve their
goal of building autonomous vehicles
–  Using MapR platform to scale deep learning
efforts https://mapr.com/company/press-releases/norcom-selects-
mapr-deep-learning/
•  Audi's new A8 takes us further down the
road to self-driving cars than ever before
–  https://www.cnet.com/roadshow/news/audis-new-a8-is-
designed-to-let-you-play-candy-crush-in-rush-hour-traffic-safely/
© 2017 MapR Technologies
Why combine IOT with Machine Learning?
•  Cheaper sensors and machine learning are making it possible for doctors to
rapidly apply smart medicine to their patients’ cases
–  https://www.wsj.com/articles/the-smart-medicine-solution-to-the-health-care-
crisis-1499443449
© 2017 MapR Technologies
Why combine IOT with Machine Learning?
•  A Stanford team has shown that a machine-learning model can identify heart
arrhythmias from an electrocardiogram (ECG) better than an expert
–  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-to-play-doctor/
© 2017 MapR Technologies
Why combine IOT with Machine Learning?
•  Connected care ensuring quicker Sepsis treatment:
–  Blood pressures, pulse rates and oxygen levels from monitoring devices
combined with algorithms to automatically calculate a score, and provide
alerts
–  http://www.computerweekly.com/news/450422258/Putting-sepsis-algorithms-into-electronic-
patient-records
© 2017 MapR Technologies
Applying Machine Learning to Live Patient Data
•  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-
live-patient-data
© 2017 MapR Technologies
Why combine IOT with Machine Learning?
•  Smart Cities will be using
1.39 billion connected
cars, IoT sensors, and
devices by 2020
•  http://www.cisco.com/c/en/us/solutions/
industries/smart-connected-communities.html
© 2017 MapR Technologies
Why combine IOT with Machine Learning?
•  Uber Near Realtime Price Surging
–  https://www.slideshare.net/ConfluentInc/kafka-uber-
the-worlds-realtime-transit-infrastructure-aaron-
schildkrout
•  machine learning & geolocation data is being
used in:
–  telecom, travel, marketing, and manufacturing
–  identify patterns and trends:
–  recommendations, anomaly detection, and fraud.
NEAR REALTIME 
PRICE SURGING
© 2017 MapR Technologies
Why combine Streaming Events with Machine Learning?
Fraud detection Smart Machinery Utility Smart Meters Home Automation
Networks Manufacturing Security Systems Patient Monitoring
© 2017 MapR Technologies
What if BP had detected problems before the oil hit the water ?
•  1M samples/sec
•  High performance at
scale is necessary!
© 2017 MapR Technologies
End to End Application Architecture
© 2017 MapR Technologies
Part 1: Spark Machine Learning
•  End to End Application for Monitoring Uber Data using Spark ML
•  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-
learning-streaming-and-kafka-api-part-1/
© 2017 MapR Technologies
What is Machine Learning?
Data Build ModelTrain Algorithm
Finds patterns
New Data Use Model
(prediction function)
Predictions
Contains patterns Recognizes patterns
© 2017 MapR Technologies
ML Discovery Model Building
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Historical
Data
Deployed
Model
Predictions
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
●  Churn Modelling
Uber
trips
Stream
TopicUber
trips
New Data
© 2017 MapR Technologies
Examples of ML Algorithms
Supervised
•  Classification
–  Naïve Bayes
–  SVM
–  Random Decision
Forests
•  Regression
–  Linear
–  Logistic
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionality reduction
–  Principal Component
Analysis
–  SVD
© 2017 MapR Technologies
Supervised Algorithms use labeled data
Data
features
Build Model
New Data
features
Predict
Use Model
© 2017 MapR Technologies
Supervised Machine Learning: Classification & Regression
Classification
Identifies
category for item
© 2017 MapR Technologies
Classification: Definition
Form of ML that:
•  Identifies which category an item belongs to
•  Uses supervised learning algorithms
–  Data is labeled
Sentiment
© 2017 MapR Technologies
If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck
swims
walks
quacks
Features:
walks
quacks
swims
Features:
© 2017 MapR Technologies
Car Insurance Fraud Example
•  What are we trying to predict?
–  This is the Label or Target outcome:
–  The amount of Fraud
•  What are the “if questions” or properties we can use to predict?
–  These are the Features:
–  The claim Amount
© 2017 MapR Technologies
Label:
Amount of Fraud
Y
X
Feature: claimed amount
Data point: fraud amount,
claimed amount
AmntFraud = intercept + coeff * claimedAmnt
Car Insurance Fraud Regression Example
© 2017 MapR Technologies
Credit Card Fraud Example
•  What are we trying to predict?
–  This is the Label:
–  The probability of Fraud
•  What are the “if questions” or properties we can use to predict?
–  These are the Features:
–  transaction amount, type of merchant, distance from and time since last transaction
© 2017 MapR Technologies
Label
Probabilty
of Fraud 1
X
Features: trans amount, type of store,
Time Location difference last trans.
Fraud
0
Not Fraud
.5
Credit Card Fraud Logistic Regression Example
© 2017 MapR Technologies
Supervised Learning: Classification & Regression
•  Classification:
–  identifies which category (eg fraud or not fraud)
•  Linear Regression:
–  predicts a value (eg amount of fraud)
•  Logistic Regression:
–  predicts a probability (eg probability of fraud)
© 2017 MapR Technologies
Examples of ML Algorithms
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionality reduction
–  Principal Component
Analysis
–  SVD
Supervised
•  Classification
–  Naïve Bayes
–  SVM
–  Random Decision
Forests
•  Regression
–  Linear
–  Logistic
© 2017 MapR Technologies
Unsupervised Algorithms use Unlabeled data
Customer GroupsBuild ModelTrain Algorithm
Finds patterns
New Customer
Purchase Data
Use Model
(prediction function) Predict Group
Contains patterns Recognizes patterns
Customer purchase
data
© 2017 MapR Technologies
Unsupervised Machine Learning: Clustering
Clustering
group news articles into different categories
© 2017 MapR Technologies
Clustering: Definition
•  Unsupervised learning task
•  Groups objects into clusters of high similarity
© 2017 MapR Technologies
Clustering: Definition
•  Unsupervised learning task
•  Groups objects into clusters of high similarity
–  Search results grouping
–  Grouping of customers, patients
–  Text categorization
–  recommendations
•  Anomaly detection: find what’s not similar
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordinates to center
of clusters (centroid)
x
x
x
x
x
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordinates to center
of clusters (centroid)
2.  Assign all points to nearest
centroid
x
x
x
x
x
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordinates to center
of clusters (centroid)
2.  Assign all points to nearest
centroid
3.  Update centroids to center of
points
x
x
x
x
x
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordinates to center
of clusters (centroid)
2.  Assign all points to nearest
centroid
3.  Update centroids to center of
points
4.  Repeat until conditions met
x
x
x
x
x
© 2017 MapR Technologies
Cluster Uber Trip Locations
© 2017 MapR Technologies
Uber Data
•  Date/Time: The date and time of the Uber pickup
•  Lat: The latitude of the Uber pickup
•  Lon: The longitude of the Uber pickup
•  Base: The TLC base company affiliated with the Uber pickup
The Data Records are in CSV format. An example line is shown below:
•  2014-08-01 00:00:00,40.729,-73.9422,B02598
© 2017 MapR Technologies
Uber Example
•  What are the “if questions” or properties we can use to group?
–  These are the Features:
–  Lattitude, longitude, Day of the week, time, rush hour …
NEAR REALTIME 
PRICE SURGING
© 2017 MapR Technologies
Spark ML workflow
© 2017 MapR Technologies
Zeppelin Notebook with Spark
Data
Engineer
Data
Scientist
© 2017 MapR Technologies
Load the data into a Dataframe: Define the Schema
case class Uber(dt: String, lat: Double, lon: Double, base: String)
val schema = StructType(Array(
StructField("dt", TimestampType, true),
StructField("lat", DoubleType, true),
StructField("lon", DoubleType, true),
StructField("base", StringType, true)
))
Input Comma Separated Values:
datetime, lattitude, longitude, base
2014-08-01 00:00:00,40.729,-73.9422,B02598
© 2017 MapR Technologies
Data
Frame
Load data
Load the data into a Dataset
val train: Dataset[Uber] = spark.read.option("inferSchema", "false")
.schema(schema).csv(”uber.csv").as[Uber]
© 2017 MapR Technologies
Dataset merged with Dataframe
•  in Spark 2.0, DataFrame APIs merged with Datasets APIs
•  A Dataset is a collection of typed objects
•  A DataFrame is a Dataset of generic Row objects
© 2017 MapR Technologies
Spark Distributed Datasets
Dataset
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
partitioned
Partition 1
8213034705, 95,
2.927373,
jake7870, 0……
Partition 2
8213034705,
115, 2.943484,
Davidbresler2,
1….
Partition 3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
Partition 4
8213034705,
117, 2.998947,
daysrus, 95….
•  Read only collection of typed objects
•  Partitioned across a cluster
•  Operated on in parallel
•  Cached in memory
© 2017 MapR Technologies
Spark Distributed Datasets
Spark revolves around RDDs
•  Read only collection of elements
•  Partitioned across a cluster
•  Operated on in parallel
•  Cached in memory
© 2017 MapR Technologies
Extract the Features
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors Model
Featurization Training
Model
Evaluation
Best Model
Training Data
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
Feature Vectors are vectors of numbers representing the value for each feature
© 2017 MapR Technologies
Data
Frame
Load data Add column DataFrame +
Features
Use VectorAssembler to put features in vector column
val featureCols = Array("lat", "lon")
val assembler = new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")
© 2017 MapR Technologies
Data
Frame
Load data transform
Estimator
val kmeans = new KMeans()
.setK(8)
.setFeaturesCol("features")
.setMaxIter(5)
Create Kmeans Estimator, Set Features
DataFrame +
Features
© 2017 MapR Technologies
Data
Frame
Load data transform
Estimator
val Array(trainingData, testData) = df2.randomSplit(Array(0.7, 0.3), 5043)
val model = kmeans.fit(trainingData)
Create Kmeans Estimator, Set Features
DataFrame +
Features
fit fitted
model
input
© 2017 MapR Technologies
Data
Frame
Load data transform
Estimator
model.clusterCenters.foreach(println)
[40.76930621976264,-73.96034885367698]
[40.67562793272868,-73.79810579052476]
[40.68848772848041,-73.9634449047477]
[40.78957777777776,-73.14270740740741]
[40.32418330308531,-74.18665245009073]
[40.732808848486286,-74.00150153727878]
[40.75396549974632,-73.57692359208531]
[40.901700842900674,-73.868760398198]
Create Kmeans Estimator, Set Features
DataFrame +
Features
fit fitted
model
input
© 2017 MapR Technologies
fitted
model
Evaluate Clusters from K-Means Estimator
transform
features
val clusters = model.transform(testdata)
prediction
DataFrame +
Features
DataFrame +
Features +
prediciton
© 2017 MapR Technologies
Kafka API and Streaming Data
© 2017 MapR Technologies
Part 2: MapR Event Streams with Kafka API and Spark Streaming
•  End to End Application for Monitoring Uber Data using Spark ML
•  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-
learning-streaming-and-kafka-api-part-2/
© 2017 MapR Technologies
Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?
© 2017 MapR Technologies
Collect the Data
Data Ingest
MapR-FS
Source
Stream
Topic
•  Data Ingest:
–  Network Based: MapR Streams,
Kafka, Kinesis, Twitter, Sockets...
–  File Based: NFS with MapR-FS,
HDFS
© 2017 MapR Technologies
Organize Data into Topics with MapR Streams
Topics Organize Events into Categories and Decouple Producers from Consumers
Consumers
MapR Cluster
Topic: Pressure
Topic: Temperature
Topic: Warnings
Consumers
Consumers
Kafka API Kafka API
© 2017 MapR Technologies
Scalable Messaging with MapR Streams
Server 1
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Server 2
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Server 3
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Topics are
partitioned for
throughput and
scalability
© 2017 MapR Technologies
Scalable Messaging with MapR Streams
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Producers are load
balanced between partitions
Kafka API
© 2017 MapR Technologies
Scalable Messaging with MapR Streams
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Consumers
Consumers
Consumers
Consumer groups can read in parallel
Kafka API
© 2017 MapR Technologies
Partition is like a Queue
Consumers
MapR Cluster
Topic: Admission / Server 1
Topic: Admission / Server 2
Topic: Admission / Server 3
Consumers
Consumers
Partition
1
New Messages are
appended to the end
Partition
2
Partition
3
6 5 4 3 2 1
3 2 1
5 4 3 2 1
Producers
Producers
Producers
New
Message
6 5 4 3 2 1
Old
Message
© 2017 MapR Technologies
Events are delivered in the order they are received, like a queue
messages are delivered in the order they are received
MapR Cluster
6 5 4 3 2 1
Consumer
groupProducers
Read cursors
Consumer
group
© 2017 MapR Technologies
Unlike a queue, events are persisted even after they’re delivered
Messages remain on the partition, available to other consumers
Minimizes Non-Sequential disk read-writes
MapR Cluster (1 Server)
Topic: Warning
Partition
1
3 2 1 Unread Events
Get Unread
3 2 1
Client Library ConsumerPoll
© 2017 MapR Technologies
How do we do this with High Performance at Scale?
Parallel operations and minimize disk read/write time
© 2017 MapR Technologies
Processing Same Message for Different Purposes
Consumers
Consumers
Consumers
Producers
Producers
Producers
MapR-FS
Kafka API Kafka API
© 2017 MapR Technologies
Use the Model with Streaming Data
© 2017 MapR Technologies
Collect Data
Process the Data with Spark Streaming and Spark Machine Learning
Process Data
Stream
Topic
•  Extension of the core Spark AP
•  Enables scalable, high-throughput,
fault-tolerant stream processing of
live data
© 2017 MapR Technologies
ML Discovery Model Building
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Historical
Data
Deployed
Model
Predictions
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
●  Churn Modelling
Uber
trips
Stream
TopicUber
trips
New Data
© 2017 MapR Technologies
Use Case: Real-Time Analysis of Geographically Clustered Vehicles
Uber trip data enrich with K-means
Cluster location
Stream
Topic
Stream
Topic
Spark
Streaming
Spark
Streaming
Write to
MapR-DB
SQL
© 2017 MapR Technologies
Use Case: Time Series Data
Uber trip data
Stream
Topic
2014-08-01 00:00:00,
40.729,-73.9422,B02598
{"dt":"2014-08-01 00:00:00.0”,
"lat":40.3495,"lon":-74.0667,
"base":"B02682","cluster":5}
Enrich with
K-means cluster id
Spark
Streaming
read
Stream
Topic
© 2017 MapR Technologies
Processing Spark DStreams
Data stream divided into batches of X milliseconds = DStreams
© 2017 MapR Technologies
Function to Parse the Message Data to Uber Objects
2014-08-01 00:00:00, 40.729,-73.9422,B02598
© 2017 MapR Technologies
Load the saved model
// load model for getting clusters
val model = KMeansModel.load(modelpath)
© 2017 MapR Technologies
Create a DStream
DStream: a sequence of RDDs
representing a stream of data
val messagesDStream = KafkaUtils.createDirectStream[String,
String](ssc, LocationStrategies.PreferConsistent,
consumerStrategy)
// get message values from key,value and parse to Uber objects
val uDStream = linesDStream.map(_._2).map(_.split(","))
.map(p => Uber(p(0), p(1).toDouble, p(2).toDouble, p(3)))
batch
time 0 to 1
batch
time 1 to 2
batch
time 2 to 3
dStream
Stored in memory
as an RDD
© 2017 MapR Technologies
Parse message txt to Uber Object and convert to DataFrame
uDStream.foreachRDD{ rdd =>
val df = rdd.toDF()
// get cluster centers and add to df
// send to Topic
}
ssc.start()
ssc.awaitTermination()
© 2017 MapR Technologies
Enrich Data with Cluster
© 2017 MapR Technologies
Convert to JSON send to Topic, Send the Enriched Message
© 2017 MapR Technologies
Process Dstream Streaming Applicaton Output
dStream RDDs
batch
time 2 to 3
batch
time 1 to 2
batch
time 0 to 1
ValueDStream RDDs
Transformed RDDs
map map map
Stream
Topic
© 2017 MapR Technologies
Real Time Dashboard
© 2017 MapR Technologies
Part 3: Realtime Dashboard using Vert.x
•  End to End Application for Monitoring Uber Data using Spark ML
•  https://mapr.com/blog/monitoring-uber-with-spark-streaming-kafka-and-
vertx/
© 2017 MapR Technologies
Serve DataCollect Data
What Do We Need to Do ?
MapR-FS
Process DataData Sources
Stream
Topic
© 2017 MapR Technologies
Use Case: Real-Time Analysis of Geographically Clustered Vehicles
Uber trip data enrich with K-means
Cluster location
Stream
Topic
Stream
Topic
Spark
Streaming
Spark
Streaming
Write to
MapR-DB
SQL
© 2017 MapR Technologies
The Vert.x toolkit and Web Application Architecture
•  Event-driven
•  Event Bus
•  Verticles single threaded
© 2017 MapR Technologies
Use Case Dashboard
© 2017 MapR Technologies
Dashboard Architecture
© 2017 MapR Technologies
Create a Vert.x Service
create a Router object, which routes HTTP request URLs to handlers
© 2017 MapR Technologies
Create a Vert.x Service
Route paths that match /eventbus/* to be associated with an
event bus bridge SockJSHandler
© 2017 MapR Technologies
Create a Vert.x Service
create an HttpServer object
tell the server to listen on the configured port for incoming
requests
© 2017 MapR Technologies
Dashboard Architecture
© 2017 MapR Technologies
Vert.x Service Kafka consumer
© 2017 MapR Technologies
Vert.x Service Kafka consumer
Create Kafka Consumer
Subscribe to Uber topic
© 2017 MapR Technologies
Vert.x Service Kafka consumer
Publish received messages to the Vert.x event bus address
“dashboard.”
© 2017 MapR Technologies
The Dashboard Vert.x HTML5 Javascript Client
© 2017 MapR Technologies
Javascript packages
© 2017 MapR Technologies
Initializing the Heatmap
© 2017 MapR Technologies
Dashboard Architecture
© 2017 MapR Technologies
Creating the Vertx EventBus
•  create an instance of the vertx.EventBus object
•  add an onopen listener, which registers an event bus handler for the
address “dashboard.”
•  handler will receive all messages published to the “dashboard” address
© 2017 MapR Technologies
Add Event Trip location points to Map
© 2017 MapR Technologies
Add Event Trip location points to Map
Parse JSON message
© 2017 MapR Technologies
Add Event Trip location points to Map
Add lattitude and longitude points to heatmap
© 2017 MapR Technologies
Add Event Trip location points to Map
If cluster center is new then add marker
© 2017 MapR Technologies
Spark and HBase
© 2017 MapR Technologies
Part 4: using MapR-DB with HBase API
•  https://mapr.com/blog/monitoring-uber-pt4/
© 2017 MapR Technologies
Serve DataStore DataCollect Data
What Do We Need to Do ?
MapR-FS
Process DataData Sources
MapR-FS
Stream
Topic
© 2017 MapR Technologies
Use Case: Real-Time Analysis of Geographically Clustered Vehicles
Uber trip data enrich with K-means
Cluster location
Stream
Topic
Stream
Topic
Spark
Streaming
Spark
Streaming
Write to
MapR-DB
SQL
© 2017 MapR Technologies
MapR-DB (HBase API) is Designed to Scale
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Fast Reads and Writes by Key! Data is automatically partitioned
by Key Range!
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
© 2017 MapR Technologies
Store Lots of Data with NoSQL MapR-DB
bottleneck
Storage ModelRDBMS MapR-DB
Normalized schema à Joins for
queries can cause bottleneck De-Normalized schema à Data that
is read together is stored together
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
© 2017 MapR Technologies
Spark Streaming writing to MapR-DB (HBase API)
© 2017 MapR Technologies
Spark HBase and MapR-DB Binary Connector
•  HConnection object in every Spark Executor:
•  allowing for distributed parallel writes, reads, or scans
© 2017 MapR Technologies
Spark Hbase streamBulkPut
•  HBaseContext streamBulkPut method parameters:
•  message value DStream, the TableName to write to, function to convert the Dstream
values to HBase put records.
© 2017 MapR Technologies
Massively Parrallel writes to HBase
The Spark Streaming bulk put enables massively parallel sending of puts to HBase
© 2017 MapR Technologies
HBase Schema
To use the Spark HBase Connector, you need to define the Catalog for the schema
mapping between the HBase and Spark
© 2017 MapR Technologies
SparkSQL and DataFrames: Define the Schema
define the Catalog for the schema mapping between the HBase and Spark
© 2017 MapR Technologies
Loading data from MapR-DB into a Spark DataFrame
Use Catalog defining schema
© 2017 MapR Technologies
Spark Dataframes combine filters and select
filters rows for cluster ids (the beginning of the row key) >= 9. The select selects a
set of columns: key, lat, and lon.
© 2017 MapR Technologies
Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-XD)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Event Streams
Sources/Apps Bulk Processing
© 2017 MapR Technologies
© 2017 MapR Technologies
To Learn More:
•  MapR Free ODT http://learn.mapr.com/
© 2017 MapR Technologies
MapR Blog
• https://www.mapr.com/blog/
© 2017 MapR Technologies
…helping you put data technology to work
●  Find answers
●  Ask technical questions
●  Join on-demand training course
discussions
●  Follow release announcements
●  Share and vote on product ideas
●  Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com
© 2017 MapR Technologies
Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
DataProcessing
Web-Scale Storage
MapR-XD MapR-DB
Search
and
Others
Real Time Unified Security Multi-tenancy Disaster
Recovery
Global NamespaceHigh Availability
MapR Evemt Streams
Cloud
and
Managed
Services
Search and
Others
UnifiedManagementandMonitoring
Search
and
Others
Event StreamingDatabase
Custom
Apps
MapR Converged Data Platform
HDFS API POSIX, NFS Kakfa APIHBase API OJAI API
© 2017 MapR Technologies
Q&A
ENGAGE WITH US
Ad

Recommended

PDF
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
PDF
Demystifying AI, Machine Learning and Deep Learning
Carol McDonald
 
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
PDF
Streaming patterns revolutionary architectures
Carol McDonald
 
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 
PDF
Applying Machine Learning to Live Patient Data
Carol McDonald
 
PDF
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
PPTX
Geo-Distributed Big Data and Analytics
MapR Technologies
 
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
 
PPTX
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
PPTX
Converging your data landscape
MapR Technologies
 
PDF
Advanced Threat Detection on Streaming Data
Carol McDonald
 
PDF
Spark and MapR Streams: A Motivating Example
Ian Downard
 
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
PDF
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 
PPTX
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
 
PPTX
MapR Product Update - Spring 2017
MapR Technologies
 
PDF
Meruvian - Introduction to MapR
The World Bank
 
PPTX
MapR Streams and MapR Converged Data Platform
MapR Technologies
 
PPTX
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
PDF
MapR & Skytree:
MapR Technologies
 
PDF
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
 
PPTX
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
MapR Technologies
 
PPTX
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
 
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
 

More Related Content

What's hot (20)

PPTX
Geo-Distributed Big Data and Analytics
MapR Technologies
 
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
 
PPTX
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
PPTX
Converging your data landscape
MapR Technologies
 
PDF
Advanced Threat Detection on Streaming Data
Carol McDonald
 
PDF
Spark and MapR Streams: A Motivating Example
Ian Downard
 
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
PDF
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 
PPTX
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
 
PPTX
MapR Product Update - Spring 2017
MapR Technologies
 
PDF
Meruvian - Introduction to MapR
The World Bank
 
PPTX
MapR Streams and MapR Converged Data Platform
MapR Technologies
 
PPTX
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
PDF
MapR & Skytree:
MapR Technologies
 
PDF
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
 
PPTX
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
MapR Technologies
 
PPTX
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 
Geo-Distributed Big Data and Analytics
MapR Technologies
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
 
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Converging your data landscape
MapR Technologies
 
Advanced Threat Detection on Streaming Data
Carol McDonald
 
Spark and MapR Streams: A Motivating Example
Ian Downard
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
 
MapR Product Update - Spring 2017
MapR Technologies
 
Meruvian - Introduction to MapR
The World Bank
 
MapR Streams and MapR Converged Data Platform
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
MapR & Skytree:
MapR Technologies
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
 
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
MapR Technologies
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 

Similar to Live Tutorial – Streaming Real-Time Events Using Apache APIs (20)

PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
 
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
 
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Carol McDonald
 
PDF
Spark machine learning predicting customer churn
Carol McDonald
 
PDF
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
 
PDF
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Codemotion
 
PPTX
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
PDF
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Codemotion
 
PPTX
MapR and Machine Learning Primer
Mathieu Dumoulin
 
PDF
Predictive Maintenance Using Recurrent Neural Networks
Justin Brandenburg
 
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
PDF
Map r chicago_advanalytics_oct_meetup
Alan Iovine
 
PDF
AI for Software Engineering
Miroslaw Staron
 
PPTX
Big Data Pipelines and Machine Learning at Uber
Sudhir Tonse
 
PPTX
Azure machine learning
Mark Reynolds
 
PPTX
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
PPTX
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
sparktc
 
PPTX
How Spark is Enabling the New Wave of Converged Cloud Applications
MapR Technologies
 
PPTX
Lecture1 BIG DATA and Types of data in details
AbhishekKumarAgrahar2
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Carol McDonald
 
Spark machine learning predicting customer churn
Carol McDonald
 
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
 
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Codemotion
 
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Codemotion
 
MapR and Machine Learning Primer
Mathieu Dumoulin
 
Predictive Maintenance Using Recurrent Neural Networks
Justin Brandenburg
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Map r chicago_advanalytics_oct_meetup
Alan Iovine
 
AI for Software Engineering
Miroslaw Staron
 
Big Data Pipelines and Machine Learning at Uber
Sudhir Tonse
 
Azure machine learning
Mark Reynolds
 
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
sparktc
 
How Spark is Enabling the New Wave of Converged Cloud Applications
MapR Technologies
 
Lecture1 BIG DATA and Types of data in details
AbhishekKumarAgrahar2
 
Ad

More from MapR Technologies (14)

PPTX
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
PPTX
MapR and Cisco Make IT Better
MapR Technologies
 
PPTX
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
MapR Technologies
 
PDF
Open Source Innovations in the MapR Ecosystem Pack 2.0
MapR Technologies
 
PDF
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR Technologies
 
PPTX
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR Technologies
 
PDF
Handling the Extremes: Scaling and Streaming in Finance
MapR Technologies
 
PDF
Baptist Health: Solving Healthcare Problems with Big Data
MapR Technologies
 
PDF
The Keys to Digital Transformation
MapR Technologies
 
PDF
Insight Platforms Accelerate Digital Transformation
MapR Technologies
 
PPTX
Design Patterns for working with Fast Data
MapR Technologies
 
PDF
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
MapR Technologies
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
MapR Technologies
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR Technologies
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR Technologies
 
Handling the Extremes: Scaling and Streaming in Finance
MapR Technologies
 
Baptist Health: Solving Healthcare Problems with Big Data
MapR Technologies
 
The Keys to Digital Transformation
MapR Technologies
 
Insight Platforms Accelerate Digital Transformation
MapR Technologies
 
Design Patterns for working with Fast Data
MapR Technologies
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
MapR Technologies
 
Ad

Recently uploaded (20)

DOCX
Starbucks in the Indian market through its joint venture.
sales480687
 
PDF
Boost Business Efficiency with Professional Data Entry Services
eloiacs eloiacs
 
PPTX
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
PPTX
Data Visualisation in data science for students
confidenceascend
 
PDF
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
PPTX
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
PDF
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
DOCX
The Influence off Flexible Work Policies
sales480687
 
PPTX
NASA ESE Study Results v4 05.29.2020.pptx
CiroAlejandroCamacho
 
PPTX
Attendance Presentation Project Excel.pptx
s2025266191
 
PPTX
25 items quiz for practical research 1 in grade 11
leamaydayaganon81
 
PDF
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
PPTX
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
PPTX
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
PPTX
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
PDF
Measurecamp Copenhagen - Consent Context
Human37
 
PPTX
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
PPT
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
PDF
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
Starbucks in the Indian market through its joint venture.
sales480687
 
Boost Business Efficiency with Professional Data Entry Services
eloiacs eloiacs
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
Data Visualisation in data science for students
confidenceascend
 
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
The Influence off Flexible Work Policies
sales480687
 
NASA ESE Study Results v4 05.29.2020.pptx
CiroAlejandroCamacho
 
Attendance Presentation Project Excel.pptx
s2025266191
 
25 items quiz for practical research 1 in grade 11
leamaydayaganon81
 
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
Measurecamp Copenhagen - Consent Context
Human37
 
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 

Live Tutorial – Streaming Real-Time Events Using Apache APIs

  • 1. © 2017 MapR Technologies Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- Time Uber Data Using Apache APIs: Kafka, Spark, HBase Carol McDonald @caroljmcdonald
  • 2. © 2017 MapR Technologies Agenda Using An End to End Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, Hbase we will discuss: •  Why IOT? •  Why combine Machine Learning with IOT? •  What is Machine Learning? How do you do it? •  Why Spark with Machine Learning? •  What is Streaming? •  Why Kafka (-ish –esque) Distributed Immutable Log ? •  Why Spark Streaming? •  Why Kafka + WebSockets? •  Why NoSQL HBase? Note: this code example is from me, only the data is from Uber
  • 3. © 2017 MapR Technologies Why IOT? Lots of Things are Producing Streaming Data Data Collection Devices Smart Machinery Phones and Tablets Home Automation RFID Systems Digital Signage Security Systems Medical Devices
  • 4. © 2017 MapR Technologies What’s a Stream ? Producers ConsumersEvents_Stream A stream is an unbounded sequence of events carried from a set of producers to a set of consumers. Events
  • 5. © 2017 MapR Technologies Why Stream Processing? 6:05 P.M.: 90° To pic Stream Temperature Turn on the air conditioning! It’s becoming important to process events as they arrive
  • 6. © 2017 MapR Technologies Why combine IOT with Machine Learning? •  Audi and Daimler harness the power of deep learning in order to achieve their goal of building autonomous vehicles –  Using MapR platform to scale deep learning efforts https://mapr.com/company/press-releases/norcom-selects- mapr-deep-learning/ •  Audi's new A8 takes us further down the road to self-driving cars than ever before –  https://www.cnet.com/roadshow/news/audis-new-a8-is- designed-to-let-you-play-candy-crush-in-rush-hour-traffic-safely/
  • 7. © 2017 MapR Technologies Why combine IOT with Machine Learning? •  Cheaper sensors and machine learning are making it possible for doctors to rapidly apply smart medicine to their patients’ cases –  https://www.wsj.com/articles/the-smart-medicine-solution-to-the-health-care- crisis-1499443449
  • 8. © 2017 MapR Technologies Why combine IOT with Machine Learning? •  A Stanford team has shown that a machine-learning model can identify heart arrhythmias from an electrocardiogram (ECG) better than an expert –  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-to-play-doctor/
  • 9. © 2017 MapR Technologies Why combine IOT with Machine Learning? •  Connected care ensuring quicker Sepsis treatment: –  Blood pressures, pulse rates and oxygen levels from monitoring devices combined with algorithms to automatically calculate a score, and provide alerts –  http://www.computerweekly.com/news/450422258/Putting-sepsis-algorithms-into-electronic- patient-records
  • 10. © 2017 MapR Technologies Applying Machine Learning to Live Patient Data •  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to- live-patient-data
  • 11. © 2017 MapR Technologies Why combine IOT with Machine Learning? •  Smart Cities will be using 1.39 billion connected cars, IoT sensors, and devices by 2020 •  http://www.cisco.com/c/en/us/solutions/ industries/smart-connected-communities.html
  • 12. © 2017 MapR Technologies Why combine IOT with Machine Learning? •  Uber Near Realtime Price Surging –  https://www.slideshare.net/ConfluentInc/kafka-uber- the-worlds-realtime-transit-infrastructure-aaron- schildkrout •  machine learning & geolocation data is being used in: –  telecom, travel, marketing, and manufacturing –  identify patterns and trends: –  recommendations, anomaly detection, and fraud. NEAR REALTIME PRICE SURGING
  • 13. © 2017 MapR Technologies Why combine Streaming Events with Machine Learning? Fraud detection Smart Machinery Utility Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring
  • 14. © 2017 MapR Technologies What if BP had detected problems before the oil hit the water ? •  1M samples/sec •  High performance at scale is necessary!
  • 15. © 2017 MapR Technologies End to End Application Architecture
  • 16. © 2017 MapR Technologies Part 1: Spark Machine Learning •  End to End Application for Monitoring Uber Data using Spark ML •  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine- learning-streaming-and-kafka-api-part-1/
  • 17. © 2017 MapR Technologies What is Machine Learning? Data Build ModelTrain Algorithm Finds patterns New Data Use Model (prediction function) Predictions Contains patterns Recognizes patterns
  • 18. © 2017 MapR Technologies ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Predictions Data Discovery, Model Creation Production Feature Extraction Feature Extraction ●  Churn Modelling Uber trips Stream TopicUber trips New Data
  • 19. © 2017 MapR Technologies Examples of ML Algorithms Supervised •  Classification –  Naïve Bayes –  SVM –  Random Decision Forests •  Regression –  Linear –  Logistic Machine Learning Unsupervised •  Clustering –  K-means •  Dimensionality reduction –  Principal Component Analysis –  SVD
  • 20. © 2017 MapR Technologies Supervised Algorithms use labeled data Data features Build Model New Data features Predict Use Model
  • 21. © 2017 MapR Technologies Supervised Machine Learning: Classification & Regression Classification Identifies category for item
  • 22. © 2017 MapR Technologies Classification: Definition Form of ML that: •  Identifies which category an item belongs to •  Uses supervised learning algorithms –  Data is labeled Sentiment
  • 23. © 2017 MapR Technologies If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck swims walks quacks Features: walks quacks swims Features:
  • 24. © 2017 MapR Technologies Car Insurance Fraud Example •  What are we trying to predict? –  This is the Label or Target outcome: –  The amount of Fraud •  What are the “if questions” or properties we can use to predict? –  These are the Features: –  The claim Amount
  • 25. © 2017 MapR Technologies Label: Amount of Fraud Y X Feature: claimed amount Data point: fraud amount, claimed amount AmntFraud = intercept + coeff * claimedAmnt Car Insurance Fraud Regression Example
  • 26. © 2017 MapR Technologies Credit Card Fraud Example •  What are we trying to predict? –  This is the Label: –  The probability of Fraud •  What are the “if questions” or properties we can use to predict? –  These are the Features: –  transaction amount, type of merchant, distance from and time since last transaction
  • 27. © 2017 MapR Technologies Label Probabilty of Fraud 1 X Features: trans amount, type of store, Time Location difference last trans. Fraud 0 Not Fraud .5 Credit Card Fraud Logistic Regression Example
  • 28. © 2017 MapR Technologies Supervised Learning: Classification & Regression •  Classification: –  identifies which category (eg fraud or not fraud) •  Linear Regression: –  predicts a value (eg amount of fraud) •  Logistic Regression: –  predicts a probability (eg probability of fraud)
  • 29. © 2017 MapR Technologies Examples of ML Algorithms Machine Learning Unsupervised •  Clustering –  K-means •  Dimensionality reduction –  Principal Component Analysis –  SVD Supervised •  Classification –  Naïve Bayes –  SVM –  Random Decision Forests •  Regression –  Linear –  Logistic
  • 30. © 2017 MapR Technologies Unsupervised Algorithms use Unlabeled data Customer GroupsBuild ModelTrain Algorithm Finds patterns New Customer Purchase Data Use Model (prediction function) Predict Group Contains patterns Recognizes patterns Customer purchase data
  • 31. © 2017 MapR Technologies Unsupervised Machine Learning: Clustering Clustering group news articles into different categories
  • 32. © 2017 MapR Technologies Clustering: Definition •  Unsupervised learning task •  Groups objects into clusters of high similarity
  • 33. © 2017 MapR Technologies Clustering: Definition •  Unsupervised learning task •  Groups objects into clusters of high similarity –  Search results grouping –  Grouping of customers, patients –  Text categorization –  recommendations •  Anomaly detection: find what’s not similar
  • 34. © 2017 MapR Technologies Clustering: Example •  Group similar objects
  • 35. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) x x x x x
  • 36. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid x x x x x
  • 37. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of points x x x x x
  • 38. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of points 4.  Repeat until conditions met x x x x x
  • 39. © 2017 MapR Technologies Cluster Uber Trip Locations
  • 40. © 2017 MapR Technologies Uber Data •  Date/Time: The date and time of the Uber pickup •  Lat: The latitude of the Uber pickup •  Lon: The longitude of the Uber pickup •  Base: The TLC base company affiliated with the Uber pickup The Data Records are in CSV format. An example line is shown below: •  2014-08-01 00:00:00,40.729,-73.9422,B02598
  • 41. © 2017 MapR Technologies Uber Example •  What are the “if questions” or properties we can use to group? –  These are the Features: –  Lattitude, longitude, Day of the week, time, rush hour … NEAR REALTIME PRICE SURGING
  • 42. © 2017 MapR Technologies Spark ML workflow
  • 43. © 2017 MapR Technologies Zeppelin Notebook with Spark Data Engineer Data Scientist
  • 44. © 2017 MapR Technologies Load the data into a Dataframe: Define the Schema case class Uber(dt: String, lat: Double, lon: Double, base: String) val schema = StructType(Array( StructField("dt", TimestampType, true), StructField("lat", DoubleType, true), StructField("lon", DoubleType, true), StructField("base", StringType, true) )) Input Comma Separated Values: datetime, lattitude, longitude, base 2014-08-01 00:00:00,40.729,-73.9422,B02598
  • 45. © 2017 MapR Technologies Data Frame Load data Load the data into a Dataset val train: Dataset[Uber] = spark.read.option("inferSchema", "false") .schema(schema).csv(”uber.csv").as[Uber]
  • 46. © 2017 MapR Technologies Dataset merged with Dataframe •  in Spark 2.0, DataFrame APIs merged with Datasets APIs •  A Dataset is a collection of typed objects •  A DataFrame is a Dataset of generic Row objects
  • 47. © 2017 MapR Technologies Spark Distributed Datasets Dataset W Executor P4 W Executor P1 P3 W Executor P2 partitioned Partition 1 8213034705, 95, 2.927373, jake7870, 0…… Partition 2 8213034705, 115, 2.943484, Davidbresler2, 1…. Partition 3 8213034705, 100, 2.951285, gladimacowgirl, 58… Partition 4 8213034705, 117, 2.998947, daysrus, 95…. •  Read only collection of typed objects •  Partitioned across a cluster •  Operated on in parallel •  Cached in memory
  • 48. © 2017 MapR Technologies Spark Distributed Datasets Spark revolves around RDDs •  Read only collection of elements •  Partitioned across a cluster •  Operated on in parallel •  Cached in memory
  • 49. © 2017 MapR Technologies Extract the Features Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization Training Model Evaluation Best Model Training Data + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ Feature Vectors are vectors of numbers representing the value for each feature
  • 50. © 2017 MapR Technologies Data Frame Load data Add column DataFrame + Features Use VectorAssembler to put features in vector column val featureCols = Array("lat", "lon") val assembler = new VectorAssembler() .setInputCols(featureCols) .setOutputCol("features")
  • 51. © 2017 MapR Technologies Data Frame Load data transform Estimator val kmeans = new KMeans() .setK(8) .setFeaturesCol("features") .setMaxIter(5) Create Kmeans Estimator, Set Features DataFrame + Features
  • 52. © 2017 MapR Technologies Data Frame Load data transform Estimator val Array(trainingData, testData) = df2.randomSplit(Array(0.7, 0.3), 5043) val model = kmeans.fit(trainingData) Create Kmeans Estimator, Set Features DataFrame + Features fit fitted model input
  • 53. © 2017 MapR Technologies Data Frame Load data transform Estimator model.clusterCenters.foreach(println) [40.76930621976264,-73.96034885367698] [40.67562793272868,-73.79810579052476] [40.68848772848041,-73.9634449047477] [40.78957777777776,-73.14270740740741] [40.32418330308531,-74.18665245009073] [40.732808848486286,-74.00150153727878] [40.75396549974632,-73.57692359208531] [40.901700842900674,-73.868760398198] Create Kmeans Estimator, Set Features DataFrame + Features fit fitted model input
  • 54. © 2017 MapR Technologies fitted model Evaluate Clusters from K-Means Estimator transform features val clusters = model.transform(testdata) prediction DataFrame + Features DataFrame + Features + prediciton
  • 55. © 2017 MapR Technologies Kafka API and Streaming Data
  • 56. © 2017 MapR Technologies Part 2: MapR Event Streams with Kafka API and Spark Streaming •  End to End Application for Monitoring Uber Data using Spark ML •  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine- learning-streaming-and-kafka-api-part-2/
  • 57. © 2017 MapR Technologies Serve DataStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? ?
  • 58. © 2017 MapR Technologies Collect the Data Data Ingest MapR-FS Source Stream Topic •  Data Ingest: –  Network Based: MapR Streams, Kafka, Kinesis, Twitter, Sockets... –  File Based: NFS with MapR-FS, HDFS
  • 59. © 2017 MapR Technologies Organize Data into Topics with MapR Streams Topics Organize Events into Categories and Decouple Producers from Consumers Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API
  • 60. © 2017 MapR Technologies Scalable Messaging with MapR Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Topics are partitioned for throughput and scalability
  • 61. © 2017 MapR Technologies Scalable Messaging with MapR Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Producers are load balanced between partitions Kafka API
  • 62. © 2017 MapR Technologies Scalable Messaging with MapR Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Consumers Consumers Consumers Consumer groups can read in parallel Kafka API
  • 63. © 2017 MapR Technologies Partition is like a Queue Consumers MapR Cluster Topic: Admission / Server 1 Topic: Admission / Server 2 Topic: Admission / Server 3 Consumers Consumers Partition 1 New Messages are appended to the end Partition 2 Partition 3 6 5 4 3 2 1 3 2 1 5 4 3 2 1 Producers Producers Producers New Message 6 5 4 3 2 1 Old Message
  • 64. © 2017 MapR Technologies Events are delivered in the order they are received, like a queue messages are delivered in the order they are received MapR Cluster 6 5 4 3 2 1 Consumer groupProducers Read cursors Consumer group
  • 65. © 2017 MapR Technologies Unlike a queue, events are persisted even after they’re delivered Messages remain on the partition, available to other consumers Minimizes Non-Sequential disk read-writes MapR Cluster (1 Server) Topic: Warning Partition 1 3 2 1 Unread Events Get Unread 3 2 1 Client Library ConsumerPoll
  • 66. © 2017 MapR Technologies How do we do this with High Performance at Scale? Parallel operations and minimize disk read/write time
  • 67. © 2017 MapR Technologies Processing Same Message for Different Purposes Consumers Consumers Consumers Producers Producers Producers MapR-FS Kafka API Kafka API
  • 68. © 2017 MapR Technologies Use the Model with Streaming Data
  • 69. © 2017 MapR Technologies Collect Data Process the Data with Spark Streaming and Spark Machine Learning Process Data Stream Topic •  Extension of the core Spark AP •  Enables scalable, high-throughput, fault-tolerant stream processing of live data
  • 70. © 2017 MapR Technologies ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Predictions Data Discovery, Model Creation Production Feature Extraction Feature Extraction ●  Churn Modelling Uber trips Stream TopicUber trips New Data
  • 71. © 2017 MapR Technologies Use Case: Real-Time Analysis of Geographically Clustered Vehicles Uber trip data enrich with K-means Cluster location Stream Topic Stream Topic Spark Streaming Spark Streaming Write to MapR-DB SQL
  • 72. © 2017 MapR Technologies Use Case: Time Series Data Uber trip data Stream Topic 2014-08-01 00:00:00, 40.729,-73.9422,B02598 {"dt":"2014-08-01 00:00:00.0”, "lat":40.3495,"lon":-74.0667, "base":"B02682","cluster":5} Enrich with K-means cluster id Spark Streaming read Stream Topic
  • 73. © 2017 MapR Technologies Processing Spark DStreams Data stream divided into batches of X milliseconds = DStreams
  • 74. © 2017 MapR Technologies Function to Parse the Message Data to Uber Objects 2014-08-01 00:00:00, 40.729,-73.9422,B02598
  • 75. © 2017 MapR Technologies Load the saved model // load model for getting clusters val model = KMeansModel.load(modelpath)
  • 76. © 2017 MapR Technologies Create a DStream DStream: a sequence of RDDs representing a stream of data val messagesDStream = KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, consumerStrategy) // get message values from key,value and parse to Uber objects val uDStream = linesDStream.map(_._2).map(_.split(",")) .map(p => Uber(p(0), p(1).toDouble, p(2).toDouble, p(3))) batch time 0 to 1 batch time 1 to 2 batch time 2 to 3 dStream Stored in memory as an RDD
  • 77. © 2017 MapR Technologies Parse message txt to Uber Object and convert to DataFrame uDStream.foreachRDD{ rdd => val df = rdd.toDF() // get cluster centers and add to df // send to Topic } ssc.start() ssc.awaitTermination()
  • 78. © 2017 MapR Technologies Enrich Data with Cluster
  • 79. © 2017 MapR Technologies Convert to JSON send to Topic, Send the Enriched Message
  • 80. © 2017 MapR Technologies Process Dstream Streaming Applicaton Output dStream RDDs batch time 2 to 3 batch time 1 to 2 batch time 0 to 1 ValueDStream RDDs Transformed RDDs map map map Stream Topic
  • 81. © 2017 MapR Technologies Real Time Dashboard
  • 82. © 2017 MapR Technologies Part 3: Realtime Dashboard using Vert.x •  End to End Application for Monitoring Uber Data using Spark ML •  https://mapr.com/blog/monitoring-uber-with-spark-streaming-kafka-and- vertx/
  • 83. © 2017 MapR Technologies Serve DataCollect Data What Do We Need to Do ? MapR-FS Process DataData Sources Stream Topic
  • 84. © 2017 MapR Technologies Use Case: Real-Time Analysis of Geographically Clustered Vehicles Uber trip data enrich with K-means Cluster location Stream Topic Stream Topic Spark Streaming Spark Streaming Write to MapR-DB SQL
  • 85. © 2017 MapR Technologies The Vert.x toolkit and Web Application Architecture •  Event-driven •  Event Bus •  Verticles single threaded
  • 86. © 2017 MapR Technologies Use Case Dashboard
  • 87. © 2017 MapR Technologies Dashboard Architecture
  • 88. © 2017 MapR Technologies Create a Vert.x Service create a Router object, which routes HTTP request URLs to handlers
  • 89. © 2017 MapR Technologies Create a Vert.x Service Route paths that match /eventbus/* to be associated with an event bus bridge SockJSHandler
  • 90. © 2017 MapR Technologies Create a Vert.x Service create an HttpServer object tell the server to listen on the configured port for incoming requests
  • 91. © 2017 MapR Technologies Dashboard Architecture
  • 92. © 2017 MapR Technologies Vert.x Service Kafka consumer
  • 93. © 2017 MapR Technologies Vert.x Service Kafka consumer Create Kafka Consumer Subscribe to Uber topic
  • 94. © 2017 MapR Technologies Vert.x Service Kafka consumer Publish received messages to the Vert.x event bus address “dashboard.”
  • 95. © 2017 MapR Technologies The Dashboard Vert.x HTML5 Javascript Client
  • 96. © 2017 MapR Technologies Javascript packages
  • 97. © 2017 MapR Technologies Initializing the Heatmap
  • 98. © 2017 MapR Technologies Dashboard Architecture
  • 99. © 2017 MapR Technologies Creating the Vertx EventBus •  create an instance of the vertx.EventBus object •  add an onopen listener, which registers an event bus handler for the address “dashboard.” •  handler will receive all messages published to the “dashboard” address
  • 100. © 2017 MapR Technologies Add Event Trip location points to Map
  • 101. © 2017 MapR Technologies Add Event Trip location points to Map Parse JSON message
  • 102. © 2017 MapR Technologies Add Event Trip location points to Map Add lattitude and longitude points to heatmap
  • 103. © 2017 MapR Technologies Add Event Trip location points to Map If cluster center is new then add marker
  • 104. © 2017 MapR Technologies Spark and HBase
  • 105. © 2017 MapR Technologies Part 4: using MapR-DB with HBase API •  https://mapr.com/blog/monitoring-uber-pt4/
  • 106. © 2017 MapR Technologies Serve DataStore DataCollect Data What Do We Need to Do ? MapR-FS Process DataData Sources MapR-FS Stream Topic
  • 107. © 2017 MapR Technologies Use Case: Real-Time Analysis of Geographically Clustered Vehicles Uber trip data enrich with K-means Cluster location Stream Topic Stream Topic Spark Streaming Spark Streaming Write to MapR-DB SQL
  • 108. © 2017 MapR Technologies MapR-DB (HBase API) is Designed to Scale Key Range xxxx xxxx Key Range xxxx xxxx Key Range xxxx xxxx Fast Reads and Writes by Key! Data is automatically partitioned by Key Range! Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  • 109. © 2017 MapR Technologies Store Lots of Data with NoSQL MapR-DB bottleneck Storage ModelRDBMS MapR-DB Normalized schema à Joins for queries can cause bottleneck De-Normalized schema à Data that is read together is stored together Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  • 110. © 2017 MapR Technologies Spark Streaming writing to MapR-DB (HBase API)
  • 111. © 2017 MapR Technologies Spark HBase and MapR-DB Binary Connector •  HConnection object in every Spark Executor: •  allowing for distributed parallel writes, reads, or scans
  • 112. © 2017 MapR Technologies Spark Hbase streamBulkPut •  HBaseContext streamBulkPut method parameters: •  message value DStream, the TableName to write to, function to convert the Dstream values to HBase put records.
  • 113. © 2017 MapR Technologies Massively Parrallel writes to HBase The Spark Streaming bulk put enables massively parallel sending of puts to HBase
  • 114. © 2017 MapR Technologies HBase Schema To use the Spark HBase Connector, you need to define the Catalog for the schema mapping between the HBase and Spark
  • 115. © 2017 MapR Technologies SparkSQL and DataFrames: Define the Schema define the Catalog for the schema mapping between the HBase and Spark
  • 116. © 2017 MapR Technologies Loading data from MapR-DB into a Spark DataFrame Use Catalog defining schema
  • 117. © 2017 MapR Technologies Spark Dataframes combine filters and select filters rows for cluster ids (the beginning of the row key) >= 9. The select selects a set of columns: key, lat, and lon.
  • 118. © 2017 MapR Technologies Stream Processing Building a Complete Data Architecture MapR File System (MapR-XD) MapR Converged Data Platform MapR Database (MapR-DB) MapR Event Streams Sources/Apps Bulk Processing
  • 119. © 2017 MapR Technologies
  • 120. © 2017 MapR Technologies To Learn More: •  MapR Free ODT http://learn.mapr.com/
  • 121. © 2017 MapR Technologies MapR Blog • https://www.mapr.com/blog/
  • 122. © 2017 MapR Technologies …helping you put data technology to work ●  Find answers ●  Ask technical questions ●  Join on-demand training course discussions ●  Follow release announcements ●  Share and vote on product ideas ●  Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com
  • 123. © 2017 MapR Technologies Open Source Engines & Tools Commercial Engines & Applications Enterprise-Grade Platform Services DataProcessing Web-Scale Storage MapR-XD MapR-DB Search and Others Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability MapR Evemt Streams Cloud and Managed Services Search and Others UnifiedManagementandMonitoring Search and Others Event StreamingDatabase Custom Apps MapR Converged Data Platform HDFS API POSIX, NFS Kakfa APIHBase API OJAI API
  • 124. © 2017 MapR Technologies Q&A ENGAGE WITH US