SlideShare a Scribd company logo
How Kafka and Modern
Databases Benefit Apps
and Analytics
1
Neil Dahlke, Sr. Sales Engineer, San Francisco
August 20 2018
2
● Intro
● Possible Solutions
● New Data Architecture
● Scalable SQL
● CREATE PIPELINE
● Demo
● Q&A
Agenda
Intro
3
AT MEMSQL
Sr. Sales Engineer, San Francisco
BEFORE MEMSQL
Worked on Globus project out @
University of Chicago
PREVIOUS TALKS
Real Time, Geospatial, Maps
Image Recognition on Streaming
Real Time w/ Spark & MemSQL
4
Who am I?
5
“Companies with data-driven environments
have up to 50% higher market value than
other businesses.”
6
Organizations want more of their data to
support faster decisions and optimize customer
experiences
This is putting pressure on database
performance and scalability but without
sacrificing familiar tooling and skills
Data Driven Requirements Driving
Database Modernization
7 Businesses Require Intra-Day
Slow Data Loading
Batch processing
Hours to load
Sampled data views
8 Growing Data Slows Performance
Lengthy Query Execution
Slow query responses
Slow reports
No real-time response
9 Data Access Requirements Surging
Limited User Access
Single threaded operations
Challenge with mixed workloads
Single box performance
10 Multi / Hybrid Cloud Strategy
● Existing solutions have unclear path
to cloud
● Data growing exponentially year
over year
● Still managing on-premises data
● Requires database to run anywhere
Possible Solutions
11
More CPUs
or memory
Specialized
HW racks
Database
Options
Boost hardware or add more DB options introduces cost
12 Double Down on Existing Database
Adding data grids, caches, and accelerators introduces complexity
13 Introduce Caching Tiers
Limited data
durability
Weak SQL
coverage
Another layer
To manage
14 Try Object Store based NoSQL Solutions
Slow performing
analytics
Developer
intensive queries
Breaks BI tool
compatibility
15 Latency Holding Back the Enterprise
Lengthy Query Execution
Slow query responses
Slow reports
No real-time response
Limited User Access
Single threaded operations
Challenge with mixed workloads
Single box performance
Slow Data Loading
Batch processing
Hours to load
Sampled data views
16 The Enterprise Requires Performance
Fast Queries
Scalable ANSI SQL
Petabyte scale
Live and historical insights
Scalable User Access
Scale-out for performance
Converged transactions and analytics
Multi-threaded processing
Live Loading
Stream data
On-the-fly transformation
Multiple sources
MemSQL: The No Limits Database17
For Every Workload
and Infrastructure
On-premises or any cloud
Transactions and analytics
Familiar, standard
scalable SQL
Distributed architecture
Relational ANSI SQL
Performance for
Demanding
Applications
Fast ingest
Low latent queries
Ecosystem Overview
High
Speed
Ingest
Memory
Optimized
Rowstore
Disk
Optimized
Columnstore
Real-Time Data
Messaging and
Transforms
Data Inputs BI Dashboards
Kafka Spark
Relational Hadoop Amazon S3
Bare Metal, Virtual Machines, Containers On-Premises, Multi-Cloud, Hybrid Cloud
Real-Time Applications
Tableau Looker Microstrategy
18
Relational Key-Value Document Geospatial
New Data
Architecture
19
20
21
22
23
24
25
26
14
MemSQL: The No-Limits Database
● Massive Scale
● Query Performance
● High Concurrency
The transactional scale of
NoSQL with familiar
relational SQL for fast
analytics
Scalable
SQL
28
MemSQL is a database, a Linux daemon
./memsqld
MemSQL is a distributed system
./memsqld./memsqld
./memsqld
Aggregators Aggregate
./memsqld./memsqld
Aggregator
Leaves Hold Partitions and Process Data
./memsqld./memsqld
Aggregator
LeafLeaf
PARTITIONS
Leaf
PARTITIONS
Aggregators interact with clients
and leverage leaf nodes
aggregator-1> create database foo;
Query OK, 1 row affected (5.48 sec)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Database Client
LeafLeaf
PARTITIONS PARTITIONS
Aggregator
leaf-2> show databases;
+--------------------+
| Database |
+--------------------+
| cluster |
| foo |
| foo_1 |
| foo_3 |
| foo_5 |
| foo_7 |
| foo_9 |
| foo_11 |
| information_schema |
| memsql |
+--------------------+
10 rows in set (0.01 sec)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Database Client
LeafLeaf
PARTITIONS PARTITIONS
Aggregator
Leaves store a partition per core on
the machine (by default)
aggregator-1> SELECT avg(price) FROM
orders;
...
1
2
3
4
leaf-1> using memsql_demo_9 SELECT
count(1), sum(price) FROM orders;
...
1
2
3
4
leaf-2> using memsql_demo_17 SELECT
count(1), sum(price) FROM orders;
...
1
2
3
4
Database Client
LeafLeaf
PARTITIONS PARTITIONS
Aggregator
Massively parallel processing (MPP)
across all the leaf nodes for query
execution
aggregator-1> ADD LEAF leaf-3…
aggregator-1> REBALANCE PARTITIONS;
1
2
3
4
Database Client
Aggregator
LeafLeafLeafLeaf
PARTITIONS PARTITIONS PARTITIONS PARTITIONS
aggregator-1> ADD LEAF leaf-4…
aggregator-1> REBALANCE PARTITIONS;
1
2
3
4
Scale up and down on the fly
[memsql.cnf]
master-agg=agg-1
1
2
3
4
Database Client
AggregatorAggregator
LeafLeafLeafLeaf
PARTITIONS PARTITIONS PARTITIONS PARTITIONS
Aggregators too
Apache Kafka38
● Messaging Queue
● Distributed
● Durable
● Publish-Subscribe
● Process
● “Source of Truth”
● Open Source
Deliver Faster Insights
● Scalable ANSI SQL
● Full ACID capabilities
● Support for JSON, Geospatial,
and Full-Text Search
● Fast Query Vectorization and
Compilation
● Extensibility with Stored
Procedures, UDFs, UDAs
39
Fast Data Ingestion
● Stream ingestion
● Fast parallel bulk loading
● Built-in Create Pipeline
● Transactional Consistency
● Exactly-Once Semantics
● Native integrations with
Kafka, AWS S3, Azure Blob,
HDFS
40
41
Stream ingestion
Batch loading
Fully parallel
Arbitrary transforms
Any language
Transactional consistency
Exactly-once semantics
CREATE
PIPELINE
42
1
2
3
4
5
6
7
CREATE PIPELINE twitter_pipeline AS
LOAD DATA KAFKA "public-kafka.memcompute.com:9092/tweets-json"
INTO TABLE tweets
WITH TRANSFORM (‘/path/to/executable’, ‘arg1’, ‘arg2’)
(id, tweet);
START PIPELINE twitter_pipeline;
43
Data Source
(ex: NFS, S3, HDFS,
Kafka)
MemSQLPIPELINE
MemSQL polls for changes from a source system.1
1
44
Data Source
(ex: NFS, S3, HDFS,
Kafka)
MemSQLPIPELINE
MemSQL polls for changes from a source system.
MemSQL pulls the data into it’s memory space (no commit) where a transform can be applied.
1
2
1
2
45
Data Source
(ex: NFS, S3, HDFS,
Kafka)
MemSQLPIPELINE
MemSQL polls for changes from a data source system.
MemSQL pulls the data into it’s memory space (no commit) where a transform can be applied.
The data is committed in a transaction (and in parallel)
1
1
3
3
2
2
46
LeafPIPELINE
Kafka Broker 1
Kafka Broker 2
Kafka Broker 3
Kafka Broker 4
LeafPIPELINE
LeafPIPELINE
LeafPIPELINE
Data
reshuffle
AggregatorPIPELINE
Metadata
query
Demo
47
Q&A
48
Thank You

More Related Content

What's hot (20)

PPTX
Curriculum Associates Strata NYC 2017
Kristi Lewandowski
 
PDF
Introduction to MemSQL
SingleStore
 
PPTX
Real-Time Analytics with Spark and MemSQL
SingleStore
 
PPTX
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
SingleStore
 
PDF
Building the Next-gen Digital Meter Platform for Fluvius
Databricks
 
PPTX
See who is using MemSQL
jenjermain
 
PDF
Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid Cloud
SingleStore
 
PDF
Presto: Fast SQL on Everything
David Phillips
 
PPTX
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
SingleStore
 
PPTX
Getting It Right Exactly Once: Principles for Streaming Architectures
SingleStore
 
PPTX
In-Memory Database Performance on AWS M4 Instances
SingleStore
 
PPTX
Real-Time Geospatial Intelligence at Scale
SingleStore
 
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
PDF
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
HostedbyConfluent
 
PDF
Personalization Journey: From Single Node to Cloud Streaming
Databricks
 
PPTX
Introducing MemSQL 4
SingleStore
 
PPTX
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
PPTX
Internet of Things and Multi-model Data Infrastructure
SingleStore
 
PPTX
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
SingleStore
 
PDF
Journey to the Real-Time Analytics in Extreme Growth
SingleStore
 
Curriculum Associates Strata NYC 2017
Kristi Lewandowski
 
Introduction to MemSQL
SingleStore
 
Real-Time Analytics with Spark and MemSQL
SingleStore
 
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
SingleStore
 
Building the Next-gen Digital Meter Platform for Fluvius
Databricks
 
See who is using MemSQL
jenjermain
 
Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid Cloud
SingleStore
 
Presto: Fast SQL on Everything
David Phillips
 
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
SingleStore
 
Getting It Right Exactly Once: Principles for Streaming Architectures
SingleStore
 
In-Memory Database Performance on AWS M4 Instances
SingleStore
 
Real-Time Geospatial Intelligence at Scale
SingleStore
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
HostedbyConfluent
 
Personalization Journey: From Single Node to Cloud Streaming
Databricks
 
Introducing MemSQL 4
SingleStore
 
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
Internet of Things and Multi-model Data Infrastructure
SingleStore
 
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
SingleStore
 
Journey to the Real-Time Analytics in Extreme Growth
SingleStore
 

Similar to How Kafka and Modern Databases Benefit Apps and Analytics (20)

PDF
Data Con LA 2019 - Integrating Kafka with a Real-Time Database by David Anderson
Data Con LA
 
PPTX
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
 
PDF
Enabling Real-Time Analytics for IoT
SingleStore
 
PDF
The Fast Path to Building Operational Applications with Spark
SingleStore
 
PDF
Database Survival Guide: Exploratory Webcast
Eric Kavanagh
 
PDF
From flat files to deconstructed database
Julien Le Dem
 
PPTX
Strata NY 2018: The deconstructed database
Julien Le Dem
 
PPTX
MemSQL 201: Advanced Tips and Tricks Webcast
SingleStore
 
PPTX
Data & Analytics Forum: Moving Telcos to Real Time
SingleStore
 
PPTX
NoSql - mayank singh
Mayank Singh
 
PDF
Big Data Expo 2015 - Gigaspaces Making Sense of it all
BigDataExpo
 
PPTX
Modeling the Smart and Connected City of the Future with Kafka and Spark
SingleStore
 
PPTX
When to Use MongoDB...and When You Should Not...
MongoDB
 
PPTX
Image Recognition on Streaming Data
SingleStore
 
PPTX
JasperWorld 2012: Reinventing Data Management by Max Schireson
MongoDB
 
PPTX
Data Streaming with Apache Kafka & MongoDB - EMEA
Andrew Morgan
 
PPTX
Webinar: Data Streaming with Apache Kafka & MongoDB
MongoDB
 
PDF
Databases for Data Science
Alexander Hendorf
 
PPTX
Data Streaming with Apache Kafka & MongoDB
confluent
 
PPTX
CTO View: Driving the On-Demand Economy with Predictive Analytics
SingleStore
 
Data Con LA 2019 - Integrating Kafka with a Real-Time Database by David Anderson
Data Con LA
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
 
Enabling Real-Time Analytics for IoT
SingleStore
 
The Fast Path to Building Operational Applications with Spark
SingleStore
 
Database Survival Guide: Exploratory Webcast
Eric Kavanagh
 
From flat files to deconstructed database
Julien Le Dem
 
Strata NY 2018: The deconstructed database
Julien Le Dem
 
MemSQL 201: Advanced Tips and Tricks Webcast
SingleStore
 
Data & Analytics Forum: Moving Telcos to Real Time
SingleStore
 
NoSql - mayank singh
Mayank Singh
 
Big Data Expo 2015 - Gigaspaces Making Sense of it all
BigDataExpo
 
Modeling the Smart and Connected City of the Future with Kafka and Spark
SingleStore
 
When to Use MongoDB...and When You Should Not...
MongoDB
 
Image Recognition on Streaming Data
SingleStore
 
JasperWorld 2012: Reinventing Data Management by Max Schireson
MongoDB
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Andrew Morgan
 
Webinar: Data Streaming with Apache Kafka & MongoDB
MongoDB
 
Databases for Data Science
Alexander Hendorf
 
Data Streaming with Apache Kafka & MongoDB
confluent
 
CTO View: Driving the On-Demand Economy with Predictive Analytics
SingleStore
 
Ad

More from SingleStore (16)

PPTX
Building a Fault Tolerant Distributed Architecture
SingleStore
 
PDF
Stream Processing with Pipelines and Stored Procedures
SingleStore
 
PPTX
Curriculum Associates Strata NYC 2017
SingleStore
 
PPTX
Spark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition
SingleStore
 
PDF
How Database Convergence Impacts the Coming Decades of Data Management
SingleStore
 
PPTX
Teaching Databases to Learn in the World of AI
SingleStore
 
PPTX
Gartner Catalyst 2017: Image Recognition on Streaming Data
SingleStore
 
PDF
Real-Time Analytics at Uber Scale
SingleStore
 
PDF
Machines and the Magic of Fast Learning
SingleStore
 
PPTX
Machines and the Magic of Fast Learning - Strata Keynote
SingleStore
 
PDF
Enabling Real-Time Analytics for IoT
SingleStore
 
PPTX
Driving the On-Demand Economy with Predictive Analytics
SingleStore
 
PPTX
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
SingleStore
 
PPTX
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
SingleStore
 
PDF
Driving the On-Demand Economy with Predictive Analytics
SingleStore
 
PDF
Building an IoT Kafka Pipeline in Under 5 Minutes
SingleStore
 
Building a Fault Tolerant Distributed Architecture
SingleStore
 
Stream Processing with Pipelines and Stored Procedures
SingleStore
 
Curriculum Associates Strata NYC 2017
SingleStore
 
Spark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition
SingleStore
 
How Database Convergence Impacts the Coming Decades of Data Management
SingleStore
 
Teaching Databases to Learn in the World of AI
SingleStore
 
Gartner Catalyst 2017: Image Recognition on Streaming Data
SingleStore
 
Real-Time Analytics at Uber Scale
SingleStore
 
Machines and the Magic of Fast Learning
SingleStore
 
Machines and the Magic of Fast Learning - Strata Keynote
SingleStore
 
Enabling Real-Time Analytics for IoT
SingleStore
 
Driving the On-Demand Economy with Predictive Analytics
SingleStore
 
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
SingleStore
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
SingleStore
 
Driving the On-Demand Economy with Predictive Analytics
SingleStore
 
Building an IoT Kafka Pipeline in Under 5 Minutes
SingleStore
 
Ad

Recently uploaded (20)

DOCX
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
PPSX
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
PDF
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
DOCX
The Influence off Flexible Work Policies
sales480687
 
PPTX
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
PDF
TCU EVALUATION FACULTY TCU Taguig City 1st Semester 2017-2018
MELJUN CORTES
 
PPTX
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
PPT
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
PDF
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
PPTX
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
PPTX
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
PPTX
Smart_Workplace_Assistant_Presentation (1).pptx
kiccha1703
 
DOCX
Starbucks in the Indian market through its joint venture.
sales480687
 
PPTX
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
DOCX
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
PDF
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
DOCX
Cat_Latin_America_in_World_Politics[1].docx
sales480687
 
PDF
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
PDF
SaleServicereport and SaleServicereport
2251330007
 
PPTX
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
The Influence off Flexible Work Policies
sales480687
 
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
TCU EVALUATION FACULTY TCU Taguig City 1st Semester 2017-2018
MELJUN CORTES
 
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
Smart_Workplace_Assistant_Presentation (1).pptx
kiccha1703
 
Starbucks in the Indian market through its joint venture.
sales480687
 
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
Cat_Latin_America_in_World_Politics[1].docx
sales480687
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
SaleServicereport and SaleServicereport
2251330007
 
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 

How Kafka and Modern Databases Benefit Apps and Analytics

  • 1. How Kafka and Modern Databases Benefit Apps and Analytics 1 Neil Dahlke, Sr. Sales Engineer, San Francisco August 20 2018
  • 2. 2 ● Intro ● Possible Solutions ● New Data Architecture ● Scalable SQL ● CREATE PIPELINE ● Demo ● Q&A Agenda
  • 4. AT MEMSQL Sr. Sales Engineer, San Francisco BEFORE MEMSQL Worked on Globus project out @ University of Chicago PREVIOUS TALKS Real Time, Geospatial, Maps Image Recognition on Streaming Real Time w/ Spark & MemSQL 4 Who am I?
  • 5. 5 “Companies with data-driven environments have up to 50% higher market value than other businesses.”
  • 6. 6 Organizations want more of their data to support faster decisions and optimize customer experiences This is putting pressure on database performance and scalability but without sacrificing familiar tooling and skills Data Driven Requirements Driving Database Modernization
  • 7. 7 Businesses Require Intra-Day Slow Data Loading Batch processing Hours to load Sampled data views
  • 8. 8 Growing Data Slows Performance Lengthy Query Execution Slow query responses Slow reports No real-time response
  • 9. 9 Data Access Requirements Surging Limited User Access Single threaded operations Challenge with mixed workloads Single box performance
  • 10. 10 Multi / Hybrid Cloud Strategy ● Existing solutions have unclear path to cloud ● Data growing exponentially year over year ● Still managing on-premises data ● Requires database to run anywhere
  • 12. More CPUs or memory Specialized HW racks Database Options Boost hardware or add more DB options introduces cost 12 Double Down on Existing Database
  • 13. Adding data grids, caches, and accelerators introduces complexity 13 Introduce Caching Tiers Limited data durability Weak SQL coverage Another layer To manage
  • 14. 14 Try Object Store based NoSQL Solutions Slow performing analytics Developer intensive queries Breaks BI tool compatibility
  • 15. 15 Latency Holding Back the Enterprise Lengthy Query Execution Slow query responses Slow reports No real-time response Limited User Access Single threaded operations Challenge with mixed workloads Single box performance Slow Data Loading Batch processing Hours to load Sampled data views
  • 16. 16 The Enterprise Requires Performance Fast Queries Scalable ANSI SQL Petabyte scale Live and historical insights Scalable User Access Scale-out for performance Converged transactions and analytics Multi-threaded processing Live Loading Stream data On-the-fly transformation Multiple sources
  • 17. MemSQL: The No Limits Database17 For Every Workload and Infrastructure On-premises or any cloud Transactions and analytics Familiar, standard scalable SQL Distributed architecture Relational ANSI SQL Performance for Demanding Applications Fast ingest Low latent queries
  • 18. Ecosystem Overview High Speed Ingest Memory Optimized Rowstore Disk Optimized Columnstore Real-Time Data Messaging and Transforms Data Inputs BI Dashboards Kafka Spark Relational Hadoop Amazon S3 Bare Metal, Virtual Machines, Containers On-Premises, Multi-Cloud, Hybrid Cloud Real-Time Applications Tableau Looker Microstrategy 18 Relational Key-Value Document Geospatial
  • 20. 20
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26
  • 27. 14 MemSQL: The No-Limits Database ● Massive Scale ● Query Performance ● High Concurrency The transactional scale of NoSQL with familiar relational SQL for fast analytics
  • 29. MemSQL is a database, a Linux daemon ./memsqld
  • 30. MemSQL is a distributed system ./memsqld./memsqld ./memsqld
  • 32. Leaves Hold Partitions and Process Data ./memsqld./memsqld Aggregator LeafLeaf PARTITIONS Leaf PARTITIONS
  • 33. Aggregators interact with clients and leverage leaf nodes aggregator-1> create database foo; Query OK, 1 row affected (5.48 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Database Client LeafLeaf PARTITIONS PARTITIONS Aggregator
  • 34. leaf-2> show databases; +--------------------+ | Database | +--------------------+ | cluster | | foo | | foo_1 | | foo_3 | | foo_5 | | foo_7 | | foo_9 | | foo_11 | | information_schema | | memsql | +--------------------+ 10 rows in set (0.01 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Database Client LeafLeaf PARTITIONS PARTITIONS Aggregator Leaves store a partition per core on the machine (by default)
  • 35. aggregator-1> SELECT avg(price) FROM orders; ... 1 2 3 4 leaf-1> using memsql_demo_9 SELECT count(1), sum(price) FROM orders; ... 1 2 3 4 leaf-2> using memsql_demo_17 SELECT count(1), sum(price) FROM orders; ... 1 2 3 4 Database Client LeafLeaf PARTITIONS PARTITIONS Aggregator Massively parallel processing (MPP) across all the leaf nodes for query execution
  • 36. aggregator-1> ADD LEAF leaf-3… aggregator-1> REBALANCE PARTITIONS; 1 2 3 4 Database Client Aggregator LeafLeafLeafLeaf PARTITIONS PARTITIONS PARTITIONS PARTITIONS aggregator-1> ADD LEAF leaf-4… aggregator-1> REBALANCE PARTITIONS; 1 2 3 4 Scale up and down on the fly
  • 38. Apache Kafka38 ● Messaging Queue ● Distributed ● Durable ● Publish-Subscribe ● Process ● “Source of Truth” ● Open Source
  • 39. Deliver Faster Insights ● Scalable ANSI SQL ● Full ACID capabilities ● Support for JSON, Geospatial, and Full-Text Search ● Fast Query Vectorization and Compilation ● Extensibility with Stored Procedures, UDFs, UDAs 39
  • 40. Fast Data Ingestion ● Stream ingestion ● Fast parallel bulk loading ● Built-in Create Pipeline ● Transactional Consistency ● Exactly-Once Semantics ● Native integrations with Kafka, AWS S3, Azure Blob, HDFS 40
  • 41. 41 Stream ingestion Batch loading Fully parallel Arbitrary transforms Any language Transactional consistency Exactly-once semantics CREATE PIPELINE
  • 42. 42 1 2 3 4 5 6 7 CREATE PIPELINE twitter_pipeline AS LOAD DATA KAFKA "public-kafka.memcompute.com:9092/tweets-json" INTO TABLE tweets WITH TRANSFORM (‘/path/to/executable’, ‘arg1’, ‘arg2’) (id, tweet); START PIPELINE twitter_pipeline;
  • 43. 43 Data Source (ex: NFS, S3, HDFS, Kafka) MemSQLPIPELINE MemSQL polls for changes from a source system.1 1
  • 44. 44 Data Source (ex: NFS, S3, HDFS, Kafka) MemSQLPIPELINE MemSQL polls for changes from a source system. MemSQL pulls the data into it’s memory space (no commit) where a transform can be applied. 1 2 1 2
  • 45. 45 Data Source (ex: NFS, S3, HDFS, Kafka) MemSQLPIPELINE MemSQL polls for changes from a data source system. MemSQL pulls the data into it’s memory space (no commit) where a transform can be applied. The data is committed in a transaction (and in parallel) 1 1 3 3 2 2
  • 46. 46 LeafPIPELINE Kafka Broker 1 Kafka Broker 2 Kafka Broker 3 Kafka Broker 4 LeafPIPELINE LeafPIPELINE LeafPIPELINE Data reshuffle AggregatorPIPELINE Metadata query