SlideShare a Scribd company logo
®
© 2016 MapR Technologies 1®
© 2016 MapR Technologies 1© 2016 MapR Technologies
®
Exploring Data Pipelines for Spark Streaming Applications
Carol McDonald, Industry Solutions Architect
2016
®
© 2016 MapR Technologies 2®
© 2016 MapR Technologies 2
What is Streaming Data? Got Some Examples?
Data Collection
Devices
Smart Machinery Phones and Tablets Home Automation
RFID Systems Digital Signage Security Systems Medical Devices
®
© 2016 MapR Technologies 3®
© 2016 MapR Technologies 3
It was hot
at 6:05
yesterday
!
Why Stream Processing?
Analyze
6:01 P.M.:
72°
6:02 P.M.:
75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
90°90°
6:01 P.M.: 72°
6:02 P.M.: 75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
Batch processing may be too late for some events
®
© 2016 MapR Technologies 4®
© 2016 MapR Technologies 4
Why Stream Processing?
6:05 P.M.: 90°
To
pic
Stream
Temperature
Turn on the
air
conditioning!
It’s becoming important to process events as they arrive
®
© 2016 MapR Technologies 5®
© 2016 MapR Technologies 5
Key to Real Time: Event-based Data Flows
web events
etc…
machine sensors
Biometrics
Mobile events
®
© 2016 MapR Technologies 6®
© 2016 MapR Technologies 6
What if BP had detected problems before the oil hit
the water ?
•  1M samples/sec
•  High performance at
scale is necessary!
®
© 2016 MapR Technologies 7®
© 2016 MapR Technologies 7
Use Case: Time Series Data
Data for
real-time monitoring
read
Sensor
time-stamped data Spark processing
Spark
Streaming
Stream
Topic
®
© 2016 MapR Technologies 8®
© 2016 MapR Technologies 8
Schema
•  All events stored, CF data could be set to expire data
•  Filtered alerts put in CF alerts
•  Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
Row Key contains oil
pump name, date, and
a time stamp
®
© 2016 MapR Technologies 9®
© 2016 MapR Technologies 9
Schema
•  All events stored, CF data could be set to expire data
•  Filtered alerts put in CF alerts
•  Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
®
© 2016 MapR Technologies 10®
© 2016 MapR Technologies 10
Schema
•  All events stored, CF data could be set to expire data
•  Filtered alerts put in CF alerts
•  Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
®
© 2016 MapR Technologies 11®
© 2016 MapR Technologies 11
Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?
®
© 2016 MapR Technologies 12®
© 2016 MapR Technologies 12
How do we do this with High Performance at Scale?
•  Parallel operations and minimize disk read/write time
®
© 2016 MapR Technologies 13®
© 2016 MapR Technologies 13
Collect the Data
Data Ingest
MapR-FS
Source
Stream
Topic
•  Data Ingest:
–  File Based: NFS with MapR-FS,
HDFS
–  Network Based: MapR Streams,
Kafka, Kinesis, Twitter, Sockets...
®
© 2016 MapR Technologies 14®
© 2016 MapR Technologies 14
MapR Streams Publish Subscribe Messaging
Topics Organize Events into Categories
and decouple Producers from Consumers
®
© 2016 MapR Technologies 15®
© 2016 MapR Technologies 15
Scalable Messaging with MapR Streams
Topics are partitioned for throughput and scalability
®
© 2016 MapR Technologies 16®
© 2016 MapR Technologies 16
How do we do this with High Performance at Scale?
•  Parallel , Partitioned = fast , scalable
–  Messaging with MapR Streams
®
© 2016 MapR Technologies 17®
© 2016 MapR Technologies 17
Collect Data
Process the Data with Spark Streaming
MapR-FS
Process Data
Stream
Topic
•  Extension of the core Spark AP
•  Enables scalable, high-throughput,
fault-tolerant stream processing of
live data
®
© 2016 MapR Technologies 18®
© 2016 MapR Technologies 18
Processing Spark DStreams
Data stream divided into batches of X milliseconds = DStreams
®
© 2016 MapR Technologies 19®
© 2016 MapR Technologies 19
Spark Resilient Distributed Datasets
RDD
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
partitioned
Partition 1
8213034705, 95,
2.927373,
jake7870, 0……
Partition 2
8213034705,
115, 2.943484,
Davidbresler2,
1….
Partition 3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
Partition 4
8213034705,
117, 2.998947,
daysrus, 95….
Spark revolves around RDDs
•  Read only collection of elements
•  Partitioned across a cluster
•  Operated on in parallel
•  Cached in memory
®
© 2016 MapR Technologies 20®
© 2016 MapR Technologies 20
Spark Resilient Distributed Datasets
Spark revolves around RDDs
•  Read only collection of elements
•  Partitioned across a cluster
•  Operated on in parallel
•  Cached in memory
®
© 2016 MapR Technologies 21®
© 2016 MapR Technologies 21
How do we do this with High Performance at Scale?
•  Parallel , Partitioned = fast , scalable
–  Processing with Spark
®
© 2016 MapR Technologies 22®
© 2016 MapR Technologies 22
Processing Spark DStreams
transformations à create new RDDs
Two types of operations on DStreams:
•  Transformations:
–  Create new DStreams
–  map, filter, reduceByKey, SQL. . .
•  Output Operations
DStream
RDDs
DStream
RDDs
transform	
  transform	
  
data from
time 0 to 1
RDD @ time 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3
RDD @ time 3
transform	
  
RDD @ time 1 RDD @ time 2
®
© 2016 MapR Technologies 23®
© 2016 MapR Technologies 23
Two types of operations on DStreams
•  Transformations
•  Output Operations: trigger
Computation
–  Save to File, HBase..
•  saveAsHadoopFiles
•  saveAsHadoopDataset
•  saveAsTextFiles
Processing Spark DStreams
Output operations à trigger computation
MapR-FS
MapR-DB
DStream
RDDs
data from
time 0 to 1
data from
time 1 to 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1 RDD @ time 2
mapmap map
savesave save
®
© 2016 MapR Technologies 24®
© 2016 MapR Technologies 24
Serve DataStore DataCollect Data
What Do We Need to Do ?
MapR-FS
Process DataData Sources
MapR-FS
Stream
Topic
®
© 2016 MapR Technologies 25®
© 2016 MapR Technologies 25
MapR-DB (HBase API) is Designed to Scale
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Fast Reads and Writes by Key! Data is automatically partitioned
by Key Range!
®
© 2016 MapR Technologies 26®
© 2016 MapR Technologies 26
Store Lots of Data with NoSQL MapR-DB
bottleneck
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Key colB col
C
val val val
xxx val val
Storage ModelRDBMS MapR-DB
Normalized schema à Joins for
queries can cause bottleneck De-Normalized schema à Data that
is read together is stored together
®
© 2016 MapR Technologies 27®
© 2016 MapR Technologies 27
Key to Real Time: Event-based Data Flows
Key to Scale = Parallel Partitioned:
•  Messaging
•  Processing
•  Storage
®
© 2016 MapR Technologies 28®
© 2016 MapR Technologies 28
Serve DataStore DataCollect Data
What Do We Need to Do ?
MapR-FS
Process DataData Sources
MapR-FS
Stream
Topic
®
© 2016 MapR Technologies 29®
© 2016 MapR Technologies 29
Use Case Example Code
Data for
real-time monitoring
read
Sensor
time-stamped data Spark processing
Spark
Streaming
Stream
Topic
®
© 2016 MapR Technologies 30®
© 2016 MapR Technologies 30
Use Case Example Code
Data for
real-time monitoring
read
Sensor
time-stamped data Spark processing
Spark
Streaming
Stream
Topic
®
© 2016 MapR Technologies 31®
© 2016 MapR Technologies 31
KafkaProducer
String topic=“/streams/pump:warning”;
public static KafkaProducer producer;
Properties properties = new Properties();
properties.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
// Instantiate KafkaProducer with properties
producer = new KafkaProducer<String, String>(properties);
String txt = “msg text”;
ProducerRecord<String, String> rec = new
ProducerRecord<String, String>(topic, txt);
producer.send(rec);
®
© 2016 MapR Technologies 32®
© 2016 MapR Technologies 32
Use Case Example Code
Data for
real-time monitoring
read
Sensor
time-stamped data Spark processing
Spark
Streaming
Stream
Topic
®
© 2016 MapR Technologies 33®
© 2016 MapR Technologies 33
Create a DStream
DStream: a sequence of RDDs
representing a stream of data
val ssc = new StreamingContext(sparkConf, Seconds(5))
val dStream = KafkaUtils.createDirectStream[String,
String](ssc, kafkaParams, topicsSet)
batch
time 0 to 1
batch
time 1 to 2
batch
time 2 to 3
dStream
Stored in memory
as an RDD
®
© 2016 MapR Technologies 34®
© 2016 MapR Technologies 34
Process DStream
val sensorDStream = dStream.map(_._2).map(parseSensor)
dStream RDDs
batch
time 2 to 3
batch
time 1 to 2
batch
time 0 to 1
sensorDStream RDDs
New RDDs created
for every batch
map map map
®
© 2016 MapR Technologies 35®
© 2016 MapR Technologies 35
Message Data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}
®
© 2016 MapR Technologies 36®
© 2016 MapR Technologies 36
DataFrame and SQL Operations
// for Each RDD
sensorDStream.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
rdd.toDF().registerTempTable("sensor")
val res = sqlContext.sql( "SELECT resid, date,
max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz,
max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp,
max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo,
max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi
FROM sensor GROUP BY resid,date")
res.show()
}
®
© 2016 MapR Technologies 37®
© 2016 MapR Technologies 37
Streaming Application Output
®
© 2016 MapR Technologies 38®
© 2016 MapR Technologies 38
Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
linesRDD DStream
sensorRDD DStream
output operation: persist
data to external storage
Put objects written
to HBase
batch
time 2-3
batch
time 1 to 2
batch
time 0 to 1
mapmap map
savesave save
®
© 2016 MapR Technologies 39®
© 2016 MapR Technologies 39
Start Receiving Data
sensorDStream.foreachRDD { rdd =>
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
®
© 2016 MapR Technologies 40®
© 2016 MapR Technologies 40
Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-FS)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Streams
Sources/Apps Bulk Processing
®
© 2016 MapR Technologies 41®
© 2016 MapR Technologies 41
To Learn More:
•  Read explanation of and Download code
–  https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-
spark-streaming-and-mapr-db
–  https://www.mapr.com/blog/spark-streaming-hbase
®
© 2016 MapR Technologies 42®
© 2016 MapR Technologies 42
To Learn More:
•  http://learn.mapr.com/
®
© 2016 MapR Technologies 43®
© 2016 MapR Technologies 43
Q&A
@mapr
@caroljmcdonald
https://www.mapr.com/blog/author/carol-mcdonald
Engage with us!
mapr-technologies

More Related Content

PDF
Getting Started with HBase
PDF
Apache Spark streaming and HBase
PDF
Getting started with HBase
PDF
NoSQL HBase schema design and SQL with Apache Drill
PPTX
Design Patterns for working with Fast Data
PDF
Introduction to Spark on Hadoop
PDF
Introduction to Spark
PPTX
Free Code Friday - Spark Streaming with HBase
Getting Started with HBase
Apache Spark streaming and HBase
Getting started with HBase
NoSQL HBase schema design and SQL with Apache Drill
Design Patterns for working with Fast Data
Introduction to Spark on Hadoop
Introduction to Spark
Free Code Friday - Spark Streaming with HBase

What's hot (20)

PPTX
Introduction to Apache HBase, MapR Tables and Security
PPTX
M7 and Apache Drill, Micheal Hausenblas
PDF
Drill into Drill – How Providing Flexibility and Performance is Possible
PDF
Cmu-2011-09.pptx
PPTX
Using Apache Drill
PPTX
Dealing with an Upside Down Internet
PPTX
Free Code Friday: Drill 101 - Basics of Apache Drill
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
PDF
MapR M7: Providing an enterprise quality Apache HBase API
PPTX
Apache drill
PPTX
Hug france-2012-12-04
PPTX
MapR 5.2: Getting More Value from the MapR Converged Community Edition
PPTX
Analyzing Real-World Data with Apache Drill
PPTX
Apache Drill
PDF
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
PDF
Apache Drill - Why, What, How
PDF
MapR 5.2: Getting More Value from the MapR Converged Data Platform
PPTX
Hive+Tez: A performance deep dive
PPTX
Working with Delimited Data in Apache Drill 1.6.0
PPTX
MapR 5.2 Product Update
Introduction to Apache HBase, MapR Tables and Security
M7 and Apache Drill, Micheal Hausenblas
Drill into Drill – How Providing Flexibility and Performance is Possible
Cmu-2011-09.pptx
Using Apache Drill
Dealing with an Upside Down Internet
Free Code Friday: Drill 101 - Basics of Apache Drill
Spark SQL versus Apache Drill: Different Tools with Different Rules
MapR M7: Providing an enterprise quality Apache HBase API
Apache drill
Hug france-2012-12-04
MapR 5.2: Getting More Value from the MapR Converged Community Edition
Analyzing Real-World Data with Apache Drill
Apache Drill
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill - Why, What, How
MapR 5.2: Getting More Value from the MapR Converged Data Platform
Hive+Tez: A performance deep dive
Working with Delimited Data in Apache Drill 1.6.0
MapR 5.2 Product Update
Ad

Viewers also liked (11)

PPTX
Apache spark core
PDF
Apache Spark Overview
PDF
Spark streaming state of the union
PPTX
Spark Internals - Hadoop Source Code Reading #16 in Japan
PPTX
Introduction to Spark - DataFactZ
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PPTX
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
PPTX
Apache Spark Core
PPTX
Apache Spark An Overview
PDF
Zero to Streaming: Spark and Cassandra
PDF
Applying Machine Learning to Live Patient Data
Apache spark core
Apache Spark Overview
Spark streaming state of the union
Spark Internals - Hadoop Source Code Reading #16 in Japan
Introduction to Spark - DataFactZ
Build a Time Series Application with Apache Spark and Apache HBase
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache Spark Core
Apache Spark An Overview
Zero to Streaming: Spark and Cassandra
Applying Machine Learning to Live Patient Data
Ad

Similar to Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API and the HBase API (20)

PPTX
How Spark is Enabling the New Wave of Converged Cloud Applications
PDF
Advanced Threat Detection on Streaming Data
PPTX
Map r seattle streams meetup oct 2016
PDF
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
PPTX
How Spark is Enabling the New Wave of Converged Applications
PDF
Fast Cars, Big Data - How Streaming Can Help Formula 1
PDF
Fast Cars, Big Data How Streaming can help Formula 1
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
PDF
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
PDF
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
PDF
The Keys to Digital Transformation
PPTX
Querying Network Packet Captures with Spark and Drill
PDF
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
PDF
Real World Use Cases: Hadoop and NoSQL in Production
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
PDF
Streaming in the Extreme
PDF
Is Spark Replacing Hadoop
PPTX
CEP - simplified streaming architecture - Strata Singapore 2016
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
How Spark is Enabling the New Wave of Converged Cloud Applications
Advanced Threat Detection on Streaming Data
Map r seattle streams meetup oct 2016
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
How Spark is Enabling the New Wave of Converged Applications
Fast Cars, Big Data - How Streaming Can Help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
The Keys to Digital Transformation
Querying Network Packet Captures with Spark and Drill
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Real World Use Cases: Hadoop and NoSQL in Production
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Streaming in the Extreme
Is Spark Replacing Hadoop
CEP - simplified streaming architecture - Strata Singapore 2016
Evolving Beyond the Data Lake: A Story of Wind and Rain

More from Carol McDonald (19)

PDF
Introduction to machine learning with GPUs
PDF
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
PDF
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
PDF
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
PDF
Predicting Flight Delays with Spark Machine Learning
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
PDF
Demystifying AI, Machine Learning and Deep Learning
PDF
Spark graphx
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
PDF
Streaming patterns revolutionary architectures
PDF
Spark machine learning predicting customer churn
PDF
Streaming Patterns Revolutionary Architectures with the Kafka API
PPTX
Apache Spark Machine Learning Decision Trees
PDF
Apache Spark Machine Learning
PDF
Machine Learning Recommendations with Spark
DOC
CU9411MW.DOC
Introduction to machine learning with GPUs
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Predicting Flight Delays with Spark Machine Learning
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Demystifying AI, Machine Learning and Deep Learning
Spark graphx
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Streaming patterns revolutionary architectures
Spark machine learning predicting customer churn
Streaming Patterns Revolutionary Architectures with the Kafka API
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning
Machine Learning Recommendations with Spark
CU9411MW.DOC

Recently uploaded (20)

PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Digital Strategies for Manufacturing Companies
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Transform Your Business with a Software ERP System
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
history of c programming in notes for students .pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
Wondershare Filmora 15 Crack With Activation Key [2025
How to Choose the Right IT Partner for Your Business in Malaysia
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Digital Strategies for Manufacturing Companies
Understanding Forklifts - TECH EHS Solution
Reimagine Home Health with the Power of Agentic AI​
Design an Analysis of Algorithms II-SECS-1021-03
Transform Your Business with a Software ERP System
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
history of c programming in notes for students .pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Computer Software and OS of computer science of grade 11.pptx
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Digital Systems & Binary Numbers (comprehensive )
Designing Intelligence for the Shop Floor.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development

Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API and the HBase API

  • 1. ® © 2016 MapR Technologies 1® © 2016 MapR Technologies 1© 2016 MapR Technologies ® Exploring Data Pipelines for Spark Streaming Applications Carol McDonald, Industry Solutions Architect 2016
  • 2. ® © 2016 MapR Technologies 2® © 2016 MapR Technologies 2 What is Streaming Data? Got Some Examples? Data Collection Devices Smart Machinery Phones and Tablets Home Automation RFID Systems Digital Signage Security Systems Medical Devices
  • 3. ® © 2016 MapR Technologies 3® © 2016 MapR Technologies 3 It was hot at 6:05 yesterday ! Why Stream Processing? Analyze 6:01 P.M.: 72° 6:02 P.M.: 75° 6:03 P.M.: 77° 6:04 P.M.: 85° 6:05 P.M.: 90° 6:06 P.M.: 85° 6:07 P.M.: 77° 6:08 P.M.: 75° 90°90° 6:01 P.M.: 72° 6:02 P.M.: 75° 6:03 P.M.: 77° 6:04 P.M.: 85° 6:05 P.M.: 90° 6:06 P.M.: 85° 6:07 P.M.: 77° 6:08 P.M.: 75° Batch processing may be too late for some events
  • 4. ® © 2016 MapR Technologies 4® © 2016 MapR Technologies 4 Why Stream Processing? 6:05 P.M.: 90° To pic Stream Temperature Turn on the air conditioning! It’s becoming important to process events as they arrive
  • 5. ® © 2016 MapR Technologies 5® © 2016 MapR Technologies 5 Key to Real Time: Event-based Data Flows web events etc… machine sensors Biometrics Mobile events
  • 6. ® © 2016 MapR Technologies 6® © 2016 MapR Technologies 6 What if BP had detected problems before the oil hit the water ? •  1M samples/sec •  High performance at scale is necessary!
  • 7. ® © 2016 MapR Technologies 7® © 2016 MapR Technologies 7 Use Case: Time Series Data Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  • 8. ® © 2016 MapR Technologies 8® © 2016 MapR Technologies 8 Schema •  All events stored, CF data could be set to expire data •  Filtered alerts put in CF alerts •  Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0 Row Key contains oil pump name, date, and a time stamp
  • 9. ® © 2016 MapR Technologies 9® © 2016 MapR Technologies 9 Schema •  All events stored, CF data could be set to expire data •  Filtered alerts put in CF alerts •  Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  • 10. ® © 2016 MapR Technologies 10® © 2016 MapR Technologies 10 Schema •  All events stored, CF data could be set to expire data •  Filtered alerts put in CF alerts •  Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  • 11. ® © 2016 MapR Technologies 11® © 2016 MapR Technologies 11 Serve DataStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? ?
  • 12. ® © 2016 MapR Technologies 12® © 2016 MapR Technologies 12 How do we do this with High Performance at Scale? •  Parallel operations and minimize disk read/write time
  • 13. ® © 2016 MapR Technologies 13® © 2016 MapR Technologies 13 Collect the Data Data Ingest MapR-FS Source Stream Topic •  Data Ingest: –  File Based: NFS with MapR-FS, HDFS –  Network Based: MapR Streams, Kafka, Kinesis, Twitter, Sockets...
  • 14. ® © 2016 MapR Technologies 14® © 2016 MapR Technologies 14 MapR Streams Publish Subscribe Messaging Topics Organize Events into Categories and decouple Producers from Consumers
  • 15. ® © 2016 MapR Technologies 15® © 2016 MapR Technologies 15 Scalable Messaging with MapR Streams Topics are partitioned for throughput and scalability
  • 16. ® © 2016 MapR Technologies 16® © 2016 MapR Technologies 16 How do we do this with High Performance at Scale? •  Parallel , Partitioned = fast , scalable –  Messaging with MapR Streams
  • 17. ® © 2016 MapR Technologies 17® © 2016 MapR Technologies 17 Collect Data Process the Data with Spark Streaming MapR-FS Process Data Stream Topic •  Extension of the core Spark AP •  Enables scalable, high-throughput, fault-tolerant stream processing of live data
  • 18. ® © 2016 MapR Technologies 18® © 2016 MapR Technologies 18 Processing Spark DStreams Data stream divided into batches of X milliseconds = DStreams
  • 19. ® © 2016 MapR Technologies 19® © 2016 MapR Technologies 19 Spark Resilient Distributed Datasets RDD W Executor P4 W Executor P1 P3 W Executor P2 partitioned Partition 1 8213034705, 95, 2.927373, jake7870, 0…… Partition 2 8213034705, 115, 2.943484, Davidbresler2, 1…. Partition 3 8213034705, 100, 2.951285, gladimacowgirl, 58… Partition 4 8213034705, 117, 2.998947, daysrus, 95…. Spark revolves around RDDs •  Read only collection of elements •  Partitioned across a cluster •  Operated on in parallel •  Cached in memory
  • 20. ® © 2016 MapR Technologies 20® © 2016 MapR Technologies 20 Spark Resilient Distributed Datasets Spark revolves around RDDs •  Read only collection of elements •  Partitioned across a cluster •  Operated on in parallel •  Cached in memory
  • 21. ® © 2016 MapR Technologies 21® © 2016 MapR Technologies 21 How do we do this with High Performance at Scale? •  Parallel , Partitioned = fast , scalable –  Processing with Spark
  • 22. ® © 2016 MapR Technologies 22® © 2016 MapR Technologies 22 Processing Spark DStreams transformations à create new RDDs Two types of operations on DStreams: •  Transformations: –  Create new DStreams –  map, filter, reduceByKey, SQL. . . •  Output Operations DStream RDDs DStream RDDs transform  transform   data from time 0 to 1 RDD @ time 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3 RDD @ time 3 transform   RDD @ time 1 RDD @ time 2
  • 23. ® © 2016 MapR Technologies 23® © 2016 MapR Technologies 23 Two types of operations on DStreams •  Transformations •  Output Operations: trigger Computation –  Save to File, HBase.. •  saveAsHadoopFiles •  saveAsHadoopDataset •  saveAsTextFiles Processing Spark DStreams Output operations à trigger computation MapR-FS MapR-DB DStream RDDs data from time 0 to 1 data from time 1 to 2 data from time 2 to 3 RDD @ time 3RDD @ time 1 RDD @ time 2 mapmap map savesave save
  • 24. ® © 2016 MapR Technologies 24® © 2016 MapR Technologies 24 Serve DataStore DataCollect Data What Do We Need to Do ? MapR-FS Process DataData Sources MapR-FS Stream Topic
  • 25. ® © 2016 MapR Technologies 25® © 2016 MapR Technologies 25 MapR-DB (HBase API) is Designed to Scale Key Range xxxx xxxx Key Range xxxx xxxx Key Range xxxx xxxx Key colB col C val val val xxx val val Key colB col C val val val xxx val val Key colB col C val val val xxx val val Fast Reads and Writes by Key! Data is automatically partitioned by Key Range!
  • 26. ® © 2016 MapR Technologies 26® © 2016 MapR Technologies 26 Store Lots of Data with NoSQL MapR-DB bottleneck Key colB col C val val val xxx val val Key colB col C val val val xxx val val Key colB col C val val val xxx val val Storage ModelRDBMS MapR-DB Normalized schema à Joins for queries can cause bottleneck De-Normalized schema à Data that is read together is stored together
  • 27. ® © 2016 MapR Technologies 27® © 2016 MapR Technologies 27 Key to Real Time: Event-based Data Flows Key to Scale = Parallel Partitioned: •  Messaging •  Processing •  Storage
  • 28. ® © 2016 MapR Technologies 28® © 2016 MapR Technologies 28 Serve DataStore DataCollect Data What Do We Need to Do ? MapR-FS Process DataData Sources MapR-FS Stream Topic
  • 29. ® © 2016 MapR Technologies 29® © 2016 MapR Technologies 29 Use Case Example Code Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  • 30. ® © 2016 MapR Technologies 30® © 2016 MapR Technologies 30 Use Case Example Code Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  • 31. ® © 2016 MapR Technologies 31® © 2016 MapR Technologies 31 KafkaProducer String topic=“/streams/pump:warning”; public static KafkaProducer producer; Properties properties = new Properties(); properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); // Instantiate KafkaProducer with properties producer = new KafkaProducer<String, String>(properties); String txt = “msg text”; ProducerRecord<String, String> rec = new ProducerRecord<String, String>(topic, txt); producer.send(rec);
  • 32. ® © 2016 MapR Technologies 32® © 2016 MapR Technologies 32 Use Case Example Code Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  • 33. ® © 2016 MapR Technologies 33® © 2016 MapR Technologies 33 Create a DStream DStream: a sequence of RDDs representing a stream of data val ssc = new StreamingContext(sparkConf, Seconds(5)) val dStream = KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topicsSet) batch time 0 to 1 batch time 1 to 2 batch time 2 to 3 dStream Stored in memory as an RDD
  • 34. ® © 2016 MapR Technologies 34® © 2016 MapR Technologies 34 Process DStream val sensorDStream = dStream.map(_._2).map(parseSensor) dStream RDDs batch time 2 to 3 batch time 1 to 2 batch time 0 to 1 sensorDStream RDDs New RDDs created for every batch map map map
  • 35. ® © 2016 MapR Technologies 35® © 2016 MapR Technologies 35 Message Data to Sensor Object case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
  • 36. ® © 2016 MapR Technologies 36® © 2016 MapR Technologies 36 DataFrame and SQL Operations // for Each RDD sensorDStream.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) rdd.toDF().registerTempTable("sensor") val res = sqlContext.sql( "SELECT resid, date, max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz, max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp, max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo, max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi FROM sensor GROUP BY resid,date") res.show() }
  • 37. ® © 2016 MapR Technologies 37® © 2016 MapR Technologies 37 Streaming Application Output
  • 38. ® © 2016 MapR Technologies 38® © 2016 MapR Technologies 38 Save to HBase rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig) linesRDD DStream sensorRDD DStream output operation: persist data to external storage Put objects written to HBase batch time 2-3 batch time 1 to 2 batch time 0 to 1 mapmap map savesave save
  • 39. ® © 2016 MapR Technologies 39® © 2016 MapR Technologies 39 Start Receiving Data sensorDStream.foreachRDD { rdd => . . . } // Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
  • 40. ® © 2016 MapR Technologies 40® © 2016 MapR Technologies 40 Stream Processing Building a Complete Data Architecture MapR File System (MapR-FS) MapR Converged Data Platform MapR Database (MapR-DB) MapR Streams Sources/Apps Bulk Processing
  • 41. ® © 2016 MapR Technologies 41® © 2016 MapR Technologies 41 To Learn More: •  Read explanation of and Download code –  https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams- spark-streaming-and-mapr-db –  https://www.mapr.com/blog/spark-streaming-hbase
  • 42. ® © 2016 MapR Technologies 42® © 2016 MapR Technologies 42 To Learn More: •  http://learn.mapr.com/
  • 43. ® © 2016 MapR Technologies 43® © 2016 MapR Technologies 43 Q&A @mapr @caroljmcdonald https://www.mapr.com/blog/author/carol-mcdonald Engage with us! mapr-technologies