SlideShare a Scribd company logo
Eric Frenkiel, MemSQL CEO and co-founder
August 11, 2015 • San Diego, CA
Real-Time Data Pipelines with
Kafka, Spark, and Operational Databases
What’s In Store
MemSQL and a
fresh look at
Lambda
architectures
Building real-time
data pipelines for
immediate impact
One architecture
for many
applications
2
MemSQL at a Glance
• Enable every company to be a real-time enterprise
• Founded 2011, based in San Francisco
• Founders are ex-Facebook, SQL Server engineers
• Deliver a database technology for modern
architecture
Enterprise Focus
3
4
The Real-Time Database for Transactions and Analytics
In-Memory Distributed Relational
Data CenterSoftware Cloud
Speed
Serving
Batch Fast Updates
Unified queries, full SQL
Fast Appends
A Fresh Look at Lambda Architectures
5
Comprehensive Architecture
Transactions
6
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Transactions
7
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Analytics
Transactions
8
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
9
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
Execution engine that spans the data spectrum
10
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
11
Building Real-Time Data
Pipelines for Immediate Impact
12
By 2020, HP predicts that over
a trillion sensors will be online
“The Internet of Things Will Drastically Change Our Future” – Datafloq
Going Real-Time is the Next Phase for Big Data
More
Devices
More
Interconnectivity
More
User Demand
…and companies are at risk of being left behind
Expensive
Not scalable
Batch only
SAN-burdened
1%
15
Success will
be driven by
real-time
analytic
applications.
16
Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
17
 A high-throughput
distributed messaging
system
 Publish and subscribe to
Kafka “topics”
 Centralized data transport
for the organization
Kafka
18
 In-memory execution
engine
 High level operators for
procedural and
programmatic analytics
 Faster than MapReduce
Spark
19
 In-memory, distributed
database
 Full transactions and
complete durability
 Enable real-time,
performant applications
MemSQL
20
Use Spark and Operational Databases Together
Spark Operational Databases
Interface Programatic Declarative
Execution Environment Job Scheduler SQL Engine and Query Optimizer
Persistent Storage Use another system Built-in
21
Subscribing to Kafka
22
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010
111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010
111100001110101100000010010010111…
1110010101000101010001010100010111
111010100011110101100011010101000…
0101111000011100101010111110001111
011010111100000000101110101100000…
Event added to message queue
Enrich and Transform the Data
23
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23,
‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010
111100001110101100000010010010111…
Persist and Prepare for Production
24
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
time
house_i
d
zip
device
_id
device_type watts
2015-
07-
06T16:4
3:40.33
Z
32928
0
94110 23
‘kitchen_app
liance’
60
… … … … … …
Go to Production
25
Compress development
timelines
SELECT ... FROM memcity_table ...
One Architecture
for Many Applications
26
Lambda Applies to Real-Time Data Pipelines
Message
Queue
Batch
Inputs DatabaseTransformation Application
27
Kafka, Spark, and MemSQL Make it Simple
Batch
Inputs Application
28
Monitoring real-time Xfinity programming and video health
30
 Collect streaming data at scale
(hundreds of MemSQL
machines)
 Proactively diagnose issues
 Query ad-hoc and in real-time
with full SQL
From 30 minutes to less than 1 second
Real-time
Analytics
Real-Time
Trend Analytics
Massive Ingest and Concurrent Analytics
 Instant accuracy to the latest repin
 Build real-time analytic applications
Real-time
analytics
32
Watch the Pinterest Demo Video here:
https://youtu.be/KXelkQFVz4E
Real-Time
Segmentation
34
Using Real-Time for Personalization
Ad Servers
EC2
Real-time
analytics
PostgreSQL
Legacy reports
Monitoring S3 (replay)
HDFS
Data Science
Vertica
Operational Data Store (ODS)
Star Schema MictoStrategy
 Reach overlap and ad optimization
 Over 60,000 queries per second
 Millisecond response times
35
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Thank You!
Visit MemSQL at Booth #518
Real-Time Demos T-Shirt GiveawayGames
37

More Related Content

PDF
Big Telco - Yousun Jeong
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
PDF
Introduction to Apache Kafka and Confluent... and why they matter
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
PPTX
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Big Telco - Yousun Jeong
Next CERN Accelerator Logging Service with Jakub Wozniak
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
Introduction to Apache Kafka and Confluent... and why they matter
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
Sa introduction to big data pipelining with cassandra & spark west mins...
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

What's hot (20)

PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
PPTX
Lambda architecture with Spark
PDF
Stsg17 speaker yousunjeong
PDF
Spark Summit EU talk by Mike Percy
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
PDF
Family data sheet HP Virtual Connect(May 2013)
PPTX
Storage Requirements and Options for Running Spark on Kubernetes
PDF
Rethinking Streaming Analytics For Scale
PDF
Introduction to Kafka Streams
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Modern ETL Pipelines with Change Data Capture
PDF
How to deploy Apache Spark 
to Mesos/DCOS
PDF
SMACK Stack 1.1
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PDF
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
PDF
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
PPTX
Flink vs. Spark
PDF
Reactive app using actor model & apache spark
Kappa Architecture on Apache Kafka and Querona: datamass.io
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Lambda architecture with Spark
Stsg17 speaker yousunjeong
Spark Summit EU talk by Mike Percy
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Family data sheet HP Virtual Connect(May 2013)
Storage Requirements and Options for Running Spark on Kubernetes
Rethinking Streaming Analytics For Scale
Introduction to Kafka Streams
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Modern ETL Pipelines with Change Data Capture
How to deploy Apache Spark 
to Mesos/DCOS
SMACK Stack 1.1
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Flink vs. Spark
Reactive app using actor model & apache spark
Ad

Viewers also liked (20)

PPTX
How to develop Big Data Pipelines for Hadoop, by Costin Leau
PPTX
Breakout: Hadoop and the Operational Data Store
PDF
Ed-Fi Community Contributions to Ed-Fi Dashboards 1.3
PPTX
Jags Ramnarayan's presentation
PPTX
O'Reilly Media Webcast: Building Real-Time Data Pipelines
PDF
Strata EU tutorial - Architectural considerations for hadoop applications
PDF
Application Architectures with Hadoop - UK Hadoop User Group
PDF
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets - Pe...
PDF
Application Architectures with Hadoop
PDF
Hadoop Application Architectures tutorial - Strata London
PDF
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
PDF
Application Architectures with Hadoop
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PPTX
Modeling the Smart and Connected City of the Future with Kafka and Spark
PDF
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
PPTX
Keynote: The Journey to Pervasive Analytics
PDF
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
PPTX
Deploying Apache Flume to enable low-latency analytics
PPTX
Streaming Data Ingest and Processing with Apache Kafka
PDF
Realtime Reporting using Spark Streaming
How to develop Big Data Pipelines for Hadoop, by Costin Leau
Breakout: Hadoop and the Operational Data Store
Ed-Fi Community Contributions to Ed-Fi Dashboards 1.3
Jags Ramnarayan's presentation
O'Reilly Media Webcast: Building Real-Time Data Pipelines
Strata EU tutorial - Architectural considerations for hadoop applications
Application Architectures with Hadoop - UK Hadoop User Group
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets - Pe...
Application Architectures with Hadoop
Hadoop Application Architectures tutorial - Strata London
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Application Architectures with Hadoop
Spark Streaming & Kafka-The Future of Stream Processing
Modeling the Smart and Connected City of the Future with Kafka and Spark
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Keynote: The Journey to Pervasive Analytics
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Deploying Apache Flume to enable low-latency analytics
Streaming Data Ingest and Processing with Apache Kafka
Realtime Reporting using Spark Streaming
Ad

Similar to Real-Time Data Pipelines with Kafka, Spark, and Operational Databases (20)

PDF
Leveraging Mainframe Data for Modern Analytics
PPTX
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
PPTX
Data & Analytics Forum: Moving Telcos to Real Time
PDF
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Event Driven Services Part 3: Putting the Micro into Microservices with State...
PDF
Putting the Micro into Microservices with Stateful Stream Processing
PDF
Confluent kafka meetupseattle jan2017
PDF
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
PDF
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
PDF
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
PDF
Confluent & Attunity: Mainframe Data Modern Analytics
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
PDF
StreamAnalytix - Multi-Engine Streaming Analytics Platform
PDF
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
PPTX
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
PPTX
How Kafka and Modern Databases Benefit Apps and Analytics
PPTX
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
Leveraging Mainframe Data for Modern Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Data & Analytics Forum: Moving Telcos to Real Time
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Event Driven Services Part 3: Putting the Micro into Microservices with State...
Putting the Micro into Microservices with Stateful Stream Processing
Confluent kafka meetupseattle jan2017
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
Confluent & Attunity: Mainframe Data Modern Analytics
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
StreamAnalytix - Multi-Engine Streaming Analytics Platform
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...

More from SingleStore (20)

PPTX
Five ways database modernization simplifies your data life
PDF
Architecting Data in the AWS Ecosystem
PPTX
Building the Foundation for a Latency-Free Life
PDF
Converging Database Transactions and Analytics
PDF
Building a Machine Learning Recommendation Engine in SQL
PPTX
MemSQL 201: Advanced Tips and Tricks Webcast
PDF
Introduction to MemSQL
PDF
An Engineering Approach to Database Evaluations
PPTX
Building a Fault Tolerant Distributed Architecture
PDF
Stream Processing with Pipelines and Stored Procedures
PPTX
Curriculum Associates Strata NYC 2017
PPTX
Image Recognition on Streaming Data
PPTX
Spark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition
PDF
The State of the Data Warehouse in 2017 and Beyond
PDF
How Database Convergence Impacts the Coming Decades of Data Management
PPTX
Teaching Databases to Learn in the World of AI
PDF
Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid Cloud
PPTX
Gartner Catalyst 2017: Image Recognition on Streaming Data
PPTX
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
PDF
Real-Time Analytics at Uber Scale
Five ways database modernization simplifies your data life
Architecting Data in the AWS Ecosystem
Building the Foundation for a Latency-Free Life
Converging Database Transactions and Analytics
Building a Machine Learning Recommendation Engine in SQL
MemSQL 201: Advanced Tips and Tricks Webcast
Introduction to MemSQL
An Engineering Approach to Database Evaluations
Building a Fault Tolerant Distributed Architecture
Stream Processing with Pipelines and Stored Procedures
Curriculum Associates Strata NYC 2017
Image Recognition on Streaming Data
Spark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition
The State of the Data Warehouse in 2017 and Beyond
How Database Convergence Impacts the Coming Decades of Data Management
Teaching Databases to Learn in the World of AI
Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid Cloud
Gartner Catalyst 2017: Image Recognition on Streaming Data
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
Real-Time Analytics at Uber Scale

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Computer network topology notes for revision
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Lecture1 pattern recognition............
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to machine learning and Linear Models
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Fluorescence-microscope_Botany_detailed content
IBA_Chapter_11_Slides_Final_Accessible.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
IB Computer Science - Internal Assessment.pptx
Computer network topology notes for revision
STUDY DESIGN details- Lt Col Maksud (21).pptx
Lecture1 pattern recognition............
.pdf is not working space design for the following data for the following dat...
Miokarditis (Inflamasi pada Otot Jantung)
Reliability_Chapter_ presentation 1221.5784
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Clinical guidelines as a resource for EBP(1).pdf
Introduction to machine learning and Linear Models
Business Acumen Training GuidePresentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
1_Introduction to advance data techniques.pptx
Introduction-to-Cloud-ComputingFinal.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Editor's Notes

  • #14: Sensors are being integrated into our cars, our phones, our medical devices – trillions of sensors impact many facets of our lives “HP expects that by 2020 a trillion sensors are needed in the world, the equivalent of 150 sensors per human. Sensors will end-up in anything imaginable” (https://datafloq.com/read/internet-of-things-with-trillions-of-sensors-will-/218) - In 2020, 25 billion connected things will be in use (Gartner); 4.9 billion (2015) http://www.gartner.com/newsroom/id/2905717 HP’s Peter Hartwell: “one trillion nanoscale sensors and actuators will need the equivalent of 1000 internets: the next huge demand for computing!”