@tgrall#Devoxx #sparkstreaming
Build a Time Series Application
with Spark and HBase
Tugdual Grall
@tgrall
MapR
Carol McDonald
@caroljmcdonald
MapR
@tgrall#Devoxx #sparkstreaming
Agenda
• Time Series
• Apache Spark & Spark Streaming
• Apache HBase
• Lab
@tgrall#Devoxx #sparkstreaming
About the Lab
• Use Spark & HBase in MapR Cluster
• Option 1: Use a SandBox (Virtual Box VM located on USB
Key)
• Option 2: Use Cloud Instance (SSH/SCP only)
• Content:
• Option 1: spark-streaming-hbase-workshop.zip on USB
• Option 2: download zip from
https://github.com/tgrall/spark-streaming-hbase-workshop
@tgrall#Devoxx #sparkstreaming
Time Series
@tgrall#Devoxx #sparkstreaming
What is a Time Series?
• Stuff with timestamps
• sensor measurements
• system stats
• log files
• ….
@tgrall#Devoxx #sparkstreaming
Got Some Examples?
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
What do we need to do?
• Acquire
• Measurement, transmission, reception
• Store
• Individually, or grouped for some amount of time
• Retrieve
• Ad hoc, flexible, correlate and aggregate
• Analyze and visualize
• We facilitate this via retrieval
@tgrall#Devoxx #sparkstreaming
Acquisition
Not usually our problem
• Sensors
• Data collection – agents, raspberry pi
• Transmission – via LAN/Wan, Mobile Network, Satellites
• Receipt into system – listening daemon or queue, or
depending on use case writing directly to the database
@tgrall#Devoxx #sparkstreaming
Storage Choice
• Flat files
• Great for rapid ingest with massive data
• Handles essentially any data type
• Less good for data requiring frequent updates
• Harder to find specific ranges
• Traditional RDBMS
• Ingests up to ~10,000/ sec; prefers well structured (numerical) data;
expensive
• NoSQL (such as MapR-DB or HBase)
• Easily handle 10,000 rows / sec / node – True linear scaling
• Handles wide variety of data
• Good for frequent updates
• Easily scanned in a range
@tgrall#Devoxx #sparkstreaming
Specific Example
Consider oil drilling rigs
• When drilling wells, there are *lots* of moving parts
• Typically a drilling rig makes about 10K samples/s
• Temperatures, pressures, magnetics, machine vibration
levels, salinity, voltage, currents, many others
• Typical project has 100 rigs
@tgrall#Devoxx #sparkstreaming
General Outline
10K samples / second / rig
x 100 rigs
= 1M samples / second
• But wait, there’s more
• Suppose you want to test your system
• Perhaps with a year of data
• And you want to load that data in << 1 year
• 100x real-time = 100M samples / second
@tgrall#Devoxx #sparkstreaming
Data Storage
• Typical time window is one hour
• Column names are offsets in time window
• Find series-uid in separate table
Key 13 43 73 103 …
…
series-uid.time-window 4.5 5.2 6.1 4.9
…
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
Why do we need NoSQL / HBase?
Relational Model
bottleneck
Key colB colC
val val val
xxx val val
Key colB colC
val val val
xxx val val
Key colB colC
val val val
xxx val val
Storage ModelRDBMS HBase
Distributed Joins, Transactions do
not scale
Data that is accessed together is
stored together
@tgrall#Devoxx #sparkstreaming
HBase is a ColumnFamily oriented Database
• Data is accessed and stored together:
• RowKey is the primary index
• Column Families group similar data by row key
CF_DATA
colA colB colC
Val val
val
CF_STATS
colA colB colC
val val
val
RowKey
series-abc.time-
window
series-efg.time-
window
Customer id Raw Data Stats
@tgrall#Devoxx #sparkstreaming
HBase is a Distributed Database
Key
Range
xxxx
xxxx
CF1
colA colB colC
val val
val
CF2
colA colB colC
val val
val
Key
Range
xxxx
xxxx
CF1
colA colB colC
val val
val
CF2
colA colB colC
val val
val
Key
Range
xxxx
xxxx
CF1
colA colB colC
val val
val
CF2
colA colB colC
val val
val
Put, Get by Key
Data is automatically
distributed across the cluster
• Key range is used for horizontal
partitioning
@tgrall#Devoxx #sparkstreaming
Basic Table Operations
• Create Table, define Column Families before data is
imported
• but not the rows keys or number/names of columns
• Low level API, technically more demanding
• Basic data access operations (CRUD):
put Inserts data into rows (both create and update)
get Accesses data from one row
scan Accesses data from a range of rows
delete Delete a row or a range of rows or columns
@tgrall#Devoxx #sparkstreaming
Learn More
• Free Online Training: http://learn.mapr.com
• DEV 320 - Apache HBase Data Model and Architecture
• DEV 325 - Apache HBase Schema Design
• DEV 330 - Developing Apache HBase Applications: Basics
• DEV 335 - Developing Apache HBase Applications: Advanced
@tgrall#Devoxx #sparkstreaming
@tgrall#Devoxx #sparkstreaming
What is Spark?
• Cluster Computing Platform
• Extends “MapReduce” with
extensions
• Streaming
• Interactive Analytics
• Run in Memory
@tgrall#Devoxx #sparkstreaming
What is Spark?
Fast
• 100x faster than M/R
Logistic regression in Hadoop and Spark
@tgrall#Devoxx #sparkstreaming
What is Spark?
Ease of Development
• Write programs quickly
• More Operators
• Interactive Shell
• Less Code
@tgrall#Devoxx #sparkstreaming
What is Spark?
Multi Language Support
• Scala
• Python
• Java
• SparkR
@tgrall#Devoxx #sparkstreaming
What is Spark?
Deployment Flexibility
• Deployment
• Local
• Standalone
• Storage
• HDFS
• MapR-FS
• S3
• Cassandra
• YARN
• Mesos
@tgrall#Devoxx #sparkstreaming
Unified Platform
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine Learning)
Spark Core (General execution engine)
GraphX
(Graph Computation)
@tgrall#Devoxx #sparkstreaming
Spark Components
Driver Program
(application)
SparkContext
Cluster Manager
Worker
Executor
Task Task
Worker
Executor
Task Task
@tgrall#Devoxx #sparkstreaming
Spark Resilient Distributed Datasets
Sensor RDD
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
sc.textFile P1
8213034705, 95,
2.927373,
jake7870, 0……
P2
8213034705,
115, 2.943484,
Davidbresler2,
1….
P3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
P4
8213034705,
117, 2.998947,
daysrus, 95….
@tgrall#Devoxx #sparkstreaming
Spark Resilient Distributed Datasets
Transformation
Filter()
Action
Count()
RDD
newRDD
Value
@tgrall#Devoxx #sparkstreaming
Spark Streaming
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine Learning)
Spark Core (General execution engine)
GraphX
(Graph Computation)
@tgrall#Devoxx #sparkstreaming
What is Streaming?
• Data Stream:
• Unbounded sequence of data arriving continuously
• Stream processing:
• Low latency processing, querying, and analyzing of real time
streaming data
@tgrall#Devoxx #sparkstreaming
Why Spark Streaming
• Many applications must process
streaming data
• With the following Requirements:
• Results in near-real-time
• Handle large workloads
• latencies of few seconds
• Use Cases
• Website statistics, monitoring
• IoT
• Fraud detection
• Social network trends
• Advertising click monetization
put
put
put
put
Time stamped data
data
• Sensor, System Metrics, Events, log files
• Stock Ticker, User Activity
• Hi Volume, Velocity
Data for real-time
monitoring
@tgrall#Devoxx #sparkstreaming
What is Spark Streaming?
• Enables scalable, high-throughput, fault-tolerant stream
processing of live data
• Extension of the core Spark
Data Sources Data Sinks
@tgrall#Devoxx #sparkstreaming
Spark Streaming Architecture
• Divide data stream into batches of X seconds
• Called DStream = sequence of RDDs
Spark
Streaming
input data
stream
DStream RDD batches
Batch
interval
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
@tgrall#Devoxx #sparkstreaming
Process DStream
• Process using transformations
• creates new RDDs
transform
Transform
map
reduceByValue
count
DStream
RDDs
Dstream
RDDs
transformtransform
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
RDD @ time 1 RDD @ time 2 RDD @ time 3
@tgrall#Devoxx #sparkstreaming
Time Series
Data for
real-time monitoring
read
Sensor
Time stamped data
HBase
Processing
data
@tgrall#Devoxx #sparkstreaming
Lab “flow”
@tgrall#Devoxx #sparkstreaming
Convert Line of CSV data to Sensor
Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}
@tgrall#Devoxx #sparkstreaming
Create a DStream
val ssc = new StreamingContext(sparkConf, Seconds(2))
val linesDStream = ssc.textFileStream(“/mapr/stream")
batch
time 0-1
linesDStream
batch
time 1-2
batch
time 1-2
DStream: a sequence of RDDs representing a
stream of data
stored in memory as an
RDD
@tgrall#Devoxx #sparkstreaming
Process DStream
val linesDStream = ssc.textFileStream(”directory path")
val sensorDStream = linesDStream.map(parseSensor)
map
new RDDs created for
every batch
batch
time 0-1
linesDStream RDDs
sensorDstream RDDs
batch
time 1-2
mapmap
batch
time 1-2
@tgrall#Devoxx #sparkstreaming
Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
Put objects written
To HBase
batch
time 0-1
linesRDD DStream
sensorRDD Dstream
batch
time 1-2
map
batch
time 1-2
HBase
save save save
output operation: persist data to external storage
map map
@tgrall#Devoxx #sparkstreaming
Learn More
• Free Spark Online Training: http://learn.mapr.com

More Related Content

PDF
Apache Spark Overview
PDF
Introduction to Spark on Hadoop
PDF
Apache Spark streaming and HBase
PPTX
NoSQL Application Development with JSON and MapR-DB
PDF
Introduction to Spark
PDF
Getting Started with HBase
PDF
Free Code Friday - Machine Learning with Apache Spark
PDF
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Apache Spark Overview
Introduction to Spark on Hadoop
Apache Spark streaming and HBase
NoSQL Application Development with JSON and MapR-DB
Introduction to Spark
Getting Started with HBase
Free Code Friday - Machine Learning with Apache Spark
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...

What's hot (20)

PDF
Advanced Threat Detection on Streaming Data
PDF
Fast Cars, Big Data How Streaming can help Formula 1
PPTX
Dealing with an Upside Down Internet
PPTX
Deep Learning vs. Cheap Learning
PPTX
When Streaming Becomes Strategic
PPTX
MapR-DB – The First In-Hadoop Document Database
PDF
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
PDF
Streaming Patterns Revolutionary Architectures with the Kafka API
PDF
NoSQL HBase schema design and SQL with Apache Drill
PPTX
MapR 5.2: Getting More Value from the MapR Converged Community Edition
PDF
Apache Spark & Hadoop
PDF
Applying Machine Learning to Live Patient Data
PPTX
Apache Spark Machine Learning Decision Trees
PDF
Hadoop2 new and noteworthy SNIA conf
PDF
Apache Eagle - Monitor Hadoop in Real Time
PDF
MapR & Skytree:
PPTX
What's new in Hadoop Common and HDFS
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
PPTX
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
PDF
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Advanced Threat Detection on Streaming Data
Fast Cars, Big Data How Streaming can help Formula 1
Dealing with an Upside Down Internet
Deep Learning vs. Cheap Learning
When Streaming Becomes Strategic
MapR-DB – The First In-Hadoop Document Database
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Streaming Patterns Revolutionary Architectures with the Kafka API
NoSQL HBase schema design and SQL with Apache Drill
MapR 5.2: Getting More Value from the MapR Converged Community Edition
Apache Spark & Hadoop
Applying Machine Learning to Live Patient Data
Apache Spark Machine Learning Decision Trees
Hadoop2 new and noteworthy SNIA conf
Apache Eagle - Monitor Hadoop in Real Time
MapR & Skytree:
What's new in Hadoop Common and HDFS
How Big Data is Reducing Costs and Improving Outcomes in Health Care
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Ad

Viewers also liked (20)

PPTX
Time-Series Apache HBase
PPTX
A 3 dimensional data model in hbase for large time-series dataset-20120915
PPTX
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
PPTX
PDF
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
PDF
C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard
PDF
PPTX
Apache spark core
PPT
SparkSQL et Cassandra - Tool In Action Devoxx 2015
PDF
The SparkSQL things you maybe confuse
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
PDF
Making Scrum Work Inside Small Businesses
PPTX
Streaming map reduce
PPTX
Musings on Secondary Indexing in HBase
PPTX
MongoDB and Apache HBase: Benchmarking
PDF
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
PPT
Lspe
PDF
HBase Consistency and Performance Improvements
PDF
Apache HBase 0.98
ODP
Search Analytics with Flume and HBase
Time-Series Apache HBase
A 3 dimensional data model in hbase for large time-series dataset-20120915
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard
Apache spark core
SparkSQL et Cassandra - Tool In Action Devoxx 2015
The SparkSQL things you maybe confuse
Getting started with SparkSQL - Desert Code Camp 2016
Making Scrum Work Inside Small Businesses
Streaming map reduce
Musings on Secondary Indexing in HBase
MongoDB and Apache HBase: Benchmarking
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Lspe
HBase Consistency and Performance Improvements
Apache HBase 0.98
Search Analytics with Flume and HBase
Ad

Similar to Build a Time Series Application with Apache Spark and Apache HBase (20)

PPTX
Free Code Friday - Spark Streaming with HBase
PDF
Headaches and Breakthroughs in Building Continuous Applications
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PPTX
Apache Spark Components
PDF
Spark Streaming Data Pipelines
PDF
Apache Spark - A High Level overview
PDF
Introduction to Spark Streaming
PDF
Lifting the hood on spark streaming - StampedeCon 2015
PDF
Extending Spark Streaming to Support Complex Event Processing
PDF
Spark cep
PPTX
Intro to Spark development
PDF
Introduction to Spark Training
PPT
strata_spark_streaming.ppt
PDF
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
PDF
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
PPT
Spark streaming
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Dev Ops Training
PPTX
APACHE SPARK.pptx
Free Code Friday - Spark Streaming with HBase
Headaches and Breakthroughs in Building Continuous Applications
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark Concepts - Spark SQL, Graphx, Streaming
Apache Spark Components
Spark Streaming Data Pipelines
Apache Spark - A High Level overview
Introduction to Spark Streaming
Lifting the hood on spark streaming - StampedeCon 2015
Extending Spark Streaming to Support Complex Event Processing
Spark cep
Intro to Spark development
Introduction to Spark Training
strata_spark_streaming.ppt
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
Spark streaming
Strata NYC 2015: What's new in Spark Streaming
Dev Ops Training
APACHE SPARK.pptx

More from Carol McDonald (18)

PDF
Introduction to machine learning with GPUs
PDF
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
PDF
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
PDF
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
PDF
Predicting Flight Delays with Spark Machine Learning
PDF
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
PDF
Demystifying AI, Machine Learning and Deep Learning
PDF
Spark graphx
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
PDF
Streaming patterns revolutionary architectures
PDF
Spark machine learning predicting customer churn
PDF
Apache Spark Machine Learning
PDF
Machine Learning Recommendations with Spark
DOC
CU9411MW.DOC
PDF
Getting started with HBase
Introduction to machine learning with GPUs
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Predicting Flight Delays with Spark Machine Learning
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Demystifying AI, Machine Learning and Deep Learning
Spark graphx
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Streaming patterns revolutionary architectures
Spark machine learning predicting customer churn
Apache Spark Machine Learning
Machine Learning Recommendations with Spark
CU9411MW.DOC
Getting started with HBase

Recently uploaded (20)

PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
2018-HIPAA-Renewal-Training for executives
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
Five Habits of High-Impact Board Members
PDF
STKI Israel Market Study 2025 version august
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
Modernising the Digital Integration Hub
PPT
What is a Computer? Input Devices /output devices
PDF
Architecture types and enterprise applications.pdf
PPT
Module 1.ppt Iot fundamentals and Architecture
PPT
Geologic Time for studying geology for geologist
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
UiPath Agentic Automation session 1: RPA to Agents
A proposed approach for plagiarism detection in Myanmar Unicode text
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
The influence of sentiment analysis in enhancing early warning system model f...
Developing a website for English-speaking practice to English as a foreign la...
2018-HIPAA-Renewal-Training for executives
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Five Habits of High-Impact Board Members
STKI Israel Market Study 2025 version august
Enhancing plagiarism detection using data pre-processing and machine learning...
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Modernising the Digital Integration Hub
What is a Computer? Input Devices /output devices
Architecture types and enterprise applications.pdf
Module 1.ppt Iot fundamentals and Architecture
Geologic Time for studying geology for geologist
1 - Historical Antecedents, Social Consideration.pdf
Chapter 5: Probability Theory and Statistics
Microsoft Excel 365/2024 Beginner's training
UiPath Agentic Automation session 1: RPA to Agents

Build a Time Series Application with Apache Spark and Apache HBase