SlideShare a Scribd company logo
Data Modeling IoT and
Time Series data in NoSQL
Matthew Brender
Drew Kerrigan
1
{ “Matt” :
‘mbrender@basho.com’,
‘mjbrender’,
‘@mjbrender’,
‘ruby, javascript, go’
}
{ “Drew” :
‘dkerrigan@basho.com’,
‘drewkerrigan’,
‘@dr00_b’,
‘erlang, elixir, go’
}
Meet your presenters
Basho Technologies | 2
Basho Snapshot
Distributed Systems Software for Big Data, IoT and Hybrid Cloud applications
Basho Technologies | 3
Founded January 2008
2011 Creators of Riak
Riak core: used by Goldman, Visa…
Riak KV: Feature-rich Distributed NoSQL database
Riak S2: Object and cloud storage software
2015 New Products
Basho Data Platform: NoSQL, caching & analytics
Riak TS: Distributed database designed for time series
120+ employees
Global Offices
Seattle (HQ), Washington DC, London, Tokyo
• Time Series Data
• Introducing Riak TS
• Data Modeling
• Coding with Riak TS
Basho Technologies | 4
Basho Technologies | 5
What is Time Series?
What is Time Series?
Basho Technologies | 6
What is Time Series?
Basho Technologies | 7
What is Time Series?
Basho Technologies | 8
How Is Time Series Data Different?
• High performance reads and writes of time series data
Basho Technologies | 9
Data location
matters
Data needs to be
easy to retrieve
using range queries
select *
from devices
where time >= 2015-08-06 1:00:00
and time <= 2015-08-06 01:10:00
and errorcode = 555123
and device_type = “mobile”
Higher write
volumes
All while still being
highly available!
With no data loss even
with a huge number of
sources
Eventually rolled up,
compressed, with
the details expired
Introducing Riak TS
Basho Technologies | 10
SERVICE
INSTANCES
STORAGE
INSTANCES
Solr
Spark
Redis
(Caching)
Solr
Elastic
Search
Web Services
3rd Party Web
Services &
Integrations
Riak KV
Key/Value
Riak S2
Object Storage
Riak TS
Time Series
Document
Store
Columnar Graph
Replication &
Synchronization
Message
Routing
Cluster
Management &
Monitoring
Logging &
Analytics
Internal Data
Store
CORE SERVICES
Riak TS Feature Details
Feature Overview
Feature Benefit
Data co-location by time and geohash or
more generally series and data family
Easily analyze temporal and geocoded data
Configure time series bucket-type that
propagates across the cluster using a simple,
SQL-like command
Simple setup for faster ROI
Greater data locality Faster data storage and retrieval
Option to store structured and semi-
structured data
Clean data written to the database eliminating
the need to cleanse data
Write queries using a subset of SQL
Faster application development. Write
applications to extract and analyze your data in
a familiar language
Near-linear scaling Easy to grow database to meet data demands
High Availability for ingest
No data loss even when data is streaming from
a large number of sources
Basho Technologies | 11
Riak TS Feature Details
• Same distributed systems benefits of Riak KV
Operational Simplicity
Fault Tolerance
Robust Client APIs
Broad Client Libraries
Massive Scalability
CRDTs
Active Anti-Entropy
Masterless
High Availability
Low Latency
Read Repair
Riak Search
Basho Technologies | 12
Riak TS Optimization
Basho Technologies | 13
Optimized
Deployment
• Data Co-Location
• Composite Keys - time or geohash,
data family
• Time quantization (quantum)
Simplified Data
Modeling
• DDL – Table and field definitions
support structured and semi-
structured data
Fast Queries
and Analysis
• Range Queries (SQL based)
• LevelDB filtering
• Spark Connector
Riak has a masterless architecture
in which every node in a cluster is
capable of serving read and write
requests.
Requests are routed to nodes using standard load balancing.
Riak TS Optimization
Basho Technologies | 14
Basho Technologies | 15
Riak KV Hashing
Riak KV Hashing
PUT
Basho Technologies | 16
Riak KV Hashing
2i Query
Basho Technologies | 17
Riak TS Hashing
PUT
Basho Technologies | 18
Riak TS Hashing
TS Query
Basho Technologies | 19
RIAK TS – Storing Structured Data
• Key format
– Objects have a composite key
(partition key and local key)
• Tables
– Buckets can be defined as
tables
– Tables can have a schema
defined using DDL
– Columns in the table can be
typed
• Data Validation
– Data is validated on input
Buckets used to Define Tables
Basho Technologies | 20
RIAK TS – Range Queries
• Use Cases
– Range queries
• Implementation Details
– SQL based query language
– Filtering rows based on column expressions
– Filtering executed in backend
– Specific columns are extracted
– Simple select with WHERE clause
• for numbers <,>=,<,<=,=,!=
• for other data types =, !=
• AND, OR (nesting operators are supported)
Query Like SQL
select *
from devices
where time >= 2015-08-06 1:00:00
and time <= 2015-08-06 01:10:00
and errorcode = 555123
and device_type = “mobile”
Basho Technologies | 21
Data Modeling
How does one approach time series
data?
The first rule…
Basho Technologies | 23
The real first rule of data modeling:
• Decide what questions you want to ask of the data
– Graphs?
– Granularity?
– Analysis?
– Monitoring?
Basho Technologies | 24
Graphs
Basho Technologies | 25
Graphs
Basho Technologies | 26
Sample Data Exercise
Hard drive test data
– https://www.backblaze.com/hard-drive-test-data.html
– https://en.wikipedia.org/wiki/S.M.A.R.T.
Basho Technologies | 27
Sample Data Exercise
Basho Technologies | 28
Data Characteristics
[Date, Serial Number, Model, Capacity (bytes), Failure, …, smart_194_raw (Temp), …]
Sample Row:
• Date: “2013-04-10”
• Model: “Hitachi HDS5C3030ALA630”
• Failure: 0
• Temp: 26
Which columns are good candidates for indexing given the question we
are asking of the data?
Basho Technologies | 29
Define the Conceptual Query
Effect of temperature on hard drive stability
Approach 1:
SELECT * FROM HardDrives
WHERE date >= 2013-01-01
AND date <= 2013-12-31
AND failure = 'true’
“Find all failures in 2013”
• Pros:
– All data is colocated physically
• Cons:
– Requires client side processing for further analysis
Basho Technologies | 30
Create the Table
riak-admin bucket-type create HardDrives '{"props":{"n_val":3,
"table_def":”
CREATE TABLE HardDrives (
date TIMESTAMP NOT NULL,
family VARCHAR NOT NULL,
failure VARCHAR NOT NULL,
serial VARCHAR,
model VARCHAR,
capacity FLOAT,
temperature FLOAT,
PRIMARY KEY (
(quantum(date, 1, ‘y'), family, failure),
date, family, failure))"}}’
Basho Technologies | 31
Ingest the Data
RawRow = [
<<“2013-04-10”>>, %% Date
<<“MJ0351YNG9Z0XA”>>, %% Serial
<<“Hitachi HDS5C3030ALA630”>>, %% Model
<<“3000592982016”>>, %% Capacity
<<“0”>>, %% Failure
…, <<“26”>>, …], %% SMART Stats with Temperature
ProcessedRow = [
1365555661000, %% Date
<<“all”>>, %% Family
<<“false”>>, %% Failure
<<“MJ0351YNG9Z0XA”>>, %% Serial
<<“Hitachi HDS5C3030ALA630”>>, %% Model
3000592982016.0, %% Capacity
26.0], %% Temperature
Basho Technologies | 32
Ingest the Data
ProcessedRow = [
convert(lists:nth(1, RawRow), date), % date
<<"all">>, % family
convert(lists:nth(5, RawRow), boolean), % failure
lists:nth(2, RawRow), % serial
lists:nth(3, RawRow), % model
convert(lists:nth(4, RawRow), float), % capacity
convert(lists:nth(51, RawRow), float) % temp
],
riakc_ts:put(Pid,<<"HardDrives">>,[ProcessedRow]).
Basho Technologies | 33
Query the Data
Start = integer_to_list(date_to_epoch_ms(<<"2013-01-01">>)),
End = integer_to_list(date_to_epoch_ms(<<"2013-12-31">>)),
Query = "select * from HardDrives
where date >= " ++ Start ++ "
and date <= " ++ End ++ "
and family = 'all'
and failure = 'true'",
{_Fields, Results} =
riakc_ts:query(Pid, list_to_binary(Query)),
Basho Technologies | 34
Process the Results
Total Failures: 112
Results:
[{
1365555661000,
<<"all">>,
<<"true">>,
<<"9VS3FM1J">>,
<<"ST31500341AS">>,
1500301910016.0,
31.0
},
{...},
{...},
...
]
Basho Technologies | 35
Results
130> ts:approach1().
Total Failures: 112
"ST31500341AS": ...
"ST3000DM001": ...
"Hitachi HDS5C4040ALE630": ...
"ST4000DM000": ...
"ST31500541AS":
18.0=1 19.0=1 20.0=2 21.0=3 22.0=2
24.0=2 25.0=1 29.0=3 30.0=1
Basho Technologies | 36
Refine the Query
New Query
SELECT * FROM HardDrives
WHERE date >= 2013-01-01
AND date <= 2013-12-31
AND model = ‘ST31500541AS‘
AND failure = 'true’
New Primary Key
PRIMARY KEY (
(quantum(date, 1, ‘y'), model, failure),
date, model, failure))"}}’
Same (but more focused) Results
"ST31500541AS":
18.0=1 19.0=1 20.0=2 21.0=3 22.0=2
24.0=2 25.0=1 29.0=3 30.0=1
Basho Technologies | 37
Think Outside the Box
New Approach: Multi-Model with Riak KV
Conceptual Query:
Read the single value of a bunch of counters!
“Find the number of failures for each Quantum, Model, and
Temperature combination”
• Pros:
– Each data point is pre-calculated, so very little client side processing
– Potentially faster, depending on a lot of variables
• Cons:
– Requires the desire to know very specific stat values prior to writing data
– Requires several counter writes for every row of raw data
Basho Technologies | 38
Create the Bucket Type
riak-admin bucket-type create HardDriveCounters
'{"props":{"datatype":"counter"}}’
Basho Technologies | 39
Ingest the Data
Failure = lists:nth(5, RawRow), % failure
Year = extract_year(lists:nth(1, RawRow), % year
Temp = lists:nth(51, RawRow),
Bucket = {<<"HardDriveCounters">>,Year},
Key = list_to_binary(binary_to_list(Model) ++
binary_to_list(Temp)),
%% We only care about failures
case Failure of
<<“1”>> ->
Counter = riakc_counter:new(),
Counter1 = riakc_counter:increment(Counter),
riakc_pb_socket:update_type(Pid,Bucket,Key,
riakc_counter:to_op(Counter1))
_ -> ok
end.
Basho Technologies | 40
Query the Data
StartTemp = 16,
EndTemp = 28,
Results = range_get(<<“2013”>>, <<“ST31500341AS”>>,
StartTemp, EndTemp, []).
...
range_get(_Year, _Model, EndTemp, EndTemp, Accum) ->
lists:reverse(Accum);
range_get(Year, Model, CurrentTemp, EndTemp, Accum) ->
Bucket = {<<"HardDriveCounters">>,Year},
Key = list_to_binary(binary_to_list(Model) ++
integer_to_list(Temp)),
{ok, Counter} = riakc_pb_socket:fetch_type(Pid,Bucket, Key),
NumFailures = riakc_counter:value(Counter),
range_get(Year, Model, CurrentTemp + 1, EndTemp,
[{CurrentTemp, NumFailures}|Accum]).
Basho Technologies | 41
Data Modeling in Riak
Multi-Model with Riak KV
• Keys: Create your own using quantum + “dimension”
• Range Queries: Create your own client side multi-get to issue incremental key gets
• Compound Queries: Create more composite keys!
• Data Location: Sometimes inefficient because data is spread across many vnodes
/ partitions
Basho Technologies | 42
Data Modeling in Riak
Time Series Modeling in Riak TS
• Keys: Automatically managed based on your PRIMARY KEY definition as well as the
values in those fields
• Range Queries: Use a well known subset of SQL to simply specify a start and end in
a WHERE clause which performs a server side multi-get
• Compound Queries: Possible with a wisely chosen composite PRIMARY KEY,
although multiple tables may still be necessary
• Data Location: Very efficient data grouping by quantums, families, and series.
Basho Technologies | 43
Conclusion
Part of the Basho Data Platform
Basho Technologies | 45
SERVICE
INSTANCES
STORAGE
INSTANCES
Solr
Spark
Redis
(Caching)
Solr
Elastic
Search
Web Services
3rd Party Web
Services &
Integrations
Riak KV
Key/Value
Riak S2
Object Storage
Riak TS
Time Series
Document
Store
Columnar Graph
Replication &
Synchronization
Message
Routing
Cluster
Management &
Monitoring
Logging &
Analytics
Internal Data
Store
CORE SERVICES
RIAK TS Feature Details
Feature Overview
Feature Benefit
Data co-location by time and geohash or
more generally series and data family
Easily analyze temporal and geocoded data
Configure time series bucket-type that
propagates across the cluster using a
simple, SQL-like command
Simple setup for faster ROI
Greater data locality Faster data storage and retrieval
Option to store structured and semi-
structured data
Clean data written to the database eliminating
the need to cleanse data
Write queries using a subset of SQL
Faster application development. Write
applications to extract and analyze your data in a
familiar language
Near-linear scaling Easy to grow database to meet data demands
High Availability for ingest
No data loss even when data is streaming from a
large number of sources
Basho Technologies | 46
QUESTIONS?
Spend Time
@basho
@riconconf
OPEN SOURCE ENTERPRISE
Basho Data Platform (code)
• Riak KV with parallel extract
Basho Data Platform, Enterprise
• Riak EE with multi-cluster replication
• Spark Leader Election Service
Basho Data Platform Add-on’s (code)
• Spark + Spark Connector
Basho Data Platform Add-on’s
• Redis + Cache Proxy
• Spark Workers + Spark Master
Download a build Contact us to get started
getting to know us
Basho Technologies | 48

More Related Content

PPTX
Delta lake and the delta architecture
PPTX
A Zen Journey to Database Management
PDF
PPTX
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PPTX
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
PDF
Yahoo's Next Generation User Profile Platform
PPTX
How Klout is changing the landscape of social media with Hadoop and BI
PDF
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Delta lake and the delta architecture
A Zen Journey to Database Management
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
Yahoo's Next Generation User Profile Platform
How Klout is changing the landscape of social media with Hadoop and BI
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...

What's hot (20)

PDF
The State of the Data Warehouse in 2017 and Beyond
PDF
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
PDF
Cassandra & Spark for IoT
PDF
Data Privacy with Apache Spark: Defensive and Offensive Approaches
PDF
Data Analytics with Druid
PPTX
Programmatic Bidding Data Streams & Druid
PDF
Building a Data Lake on AWS
PPTX
Using druid for interactive count distinct queries at scale
PPTX
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
PDF
Druid @ branch
PPTX
Symantec: Cassandra Data Modelling techniques in action
PPTX
Google Cloud Spanner Preview
PDF
Analyze and visualize non-relational data with DocumentDB + Power BI
PDF
Feeding Cassandra with Spark-Streaming and Kafka
PDF
Analytics with Spark and Cassandra
PDF
Aggregated queries with Druid on terrabytes and petabytes of data
PDF
NoSQL no more: SQL on Druid with Apache Calcite
PPTX
Integration Monday - Analysing StackExchange data with Azure Data Lake
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Architecting Data in the AWS Ecosystem
The State of the Data Warehouse in 2017 and Beyond
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Cassandra & Spark for IoT
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Analytics with Druid
Programmatic Bidding Data Streams & Druid
Building a Data Lake on AWS
Using druid for interactive count distinct queries at scale
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Druid @ branch
Symantec: Cassandra Data Modelling techniques in action
Google Cloud Spanner Preview
Analyze and visualize non-relational data with DocumentDB + Power BI
Feeding Cassandra with Spark-Streaming and Kafka
Analytics with Spark and Cassandra
Aggregated queries with Druid on terrabytes and petabytes of data
NoSQL no more: SQL on Druid with Apache Calcite
Integration Monday - Analysing StackExchange data with Azure Data Lake
Introduction SQL Analytics on Lakehouse Architecture
Architecting Data in the AWS Ecosystem
Ad

Similar to Data Modeling IoT and Time Series data in NoSQL (20)

PPTX
Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
PDF
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
PPTX
Riak TS
PDF
Analysis and evaluation of riak kv cluster environment using basho bench
DOCX
Analysis and evaluation of riak kv cluster environment using basho bench
PDF
Hugfr SPARK & RIAK -20160114_hug_france
PPTX
Basho pres
PDF
Spark Summit EU talk by John Musser
PDF
Streaming Hadoop for Enterprise Adoption
PDF
Getting Started with Riak - NoSQL Live 2010 - Boston
PDF
Cassandra in production
PDF
Building a custom time series db - Colin Hemmings at #DOXLON
PDF
Owning time series with team apache Strata San Jose 2015
PDF
Your Timestamps Deserve Better than a Generic Database
PPTX
Big Data Warehousing Meetup with Riak
PPTX
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
PDF
PPTX
O2 060814
PDF
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
PDF
Scale Relational Database with NewSQL
Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Riak TS
Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho bench
Hugfr SPARK & RIAK -20160114_hug_france
Basho pres
Spark Summit EU talk by John Musser
Streaming Hadoop for Enterprise Adoption
Getting Started with Riak - NoSQL Live 2010 - Boston
Cassandra in production
Building a custom time series db - Colin Hemmings at #DOXLON
Owning time series with team apache Strata San Jose 2015
Your Timestamps Deserve Better than a Generic Database
Big Data Warehousing Meetup with Riak
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
O2 060814
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Scale Relational Database with NewSQL
Ad

More from Basho Technologies (10)

PPTX
Vagrant up a Distributed Test Environment - Nginx Summit 2015
PPTX
O'Reilly Webinar: Simplicity Scales - Big Data
PPTX
A little about Message Queues - Boston Riak Meetup
PPTX
tecFinal 451 webinar deck
PPTX
NoSQL Implementation - Part 1 (Velocity 2015)
PPTX
Coding with Riak (from Velocity 2015)
PDF
Relational Databases to Riak
PDF
Taming Big Data with NoSQL
PPTX
Basho and Riak at GOTO Stockholm: "Don't Use My Database."
KEY
Using Basho Bench to Load Test Distributed Applications
Vagrant up a Distributed Test Environment - Nginx Summit 2015
O'Reilly Webinar: Simplicity Scales - Big Data
A little about Message Queues - Boston Riak Meetup
tecFinal 451 webinar deck
NoSQL Implementation - Part 1 (Velocity 2015)
Coding with Riak (from Velocity 2015)
Relational Databases to Riak
Taming Big Data with NoSQL
Basho and Riak at GOTO Stockholm: "Don't Use My Database."
Using Basho Bench to Load Test Distributed Applications

Recently uploaded (20)

PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
history of c programming in notes for students .pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
System and Network Administration Chapter 2
PDF
Nekopoi APK 2025 free lastest update
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
assetexplorer- product-overview - presentation
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
top salesforce developer skills in 2025.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Design an Analysis of Algorithms I-SECS-1021-03
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
history of c programming in notes for students .pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
System and Network Administration Chapter 2
Nekopoi APK 2025 free lastest update
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PTS Company Brochure 2025 (1).pdf.......
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
assetexplorer- product-overview - presentation
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Reimagine Home Health with the Power of Agentic AI​
VVF-Customer-Presentation2025-Ver1.9.pptx
top salesforce developer skills in 2025.pdf
ai tools demonstartion for schools and inter college
Upgrade and Innovation Strategies for SAP ERP Customers
Design an Analysis of Algorithms I-SECS-1021-03

Data Modeling IoT and Time Series data in NoSQL

  • 1. Data Modeling IoT and Time Series data in NoSQL Matthew Brender Drew Kerrigan 1
  • 2. { “Matt” : ‘[email protected]’, ‘mjbrender’, ‘@mjbrender’, ‘ruby, javascript, go’ } { “Drew” : ‘[email protected]’, ‘drewkerrigan’, ‘@dr00_b’, ‘erlang, elixir, go’ } Meet your presenters Basho Technologies | 2
  • 3. Basho Snapshot Distributed Systems Software for Big Data, IoT and Hybrid Cloud applications Basho Technologies | 3 Founded January 2008 2011 Creators of Riak Riak core: used by Goldman, Visa… Riak KV: Feature-rich Distributed NoSQL database Riak S2: Object and cloud storage software 2015 New Products Basho Data Platform: NoSQL, caching & analytics Riak TS: Distributed database designed for time series 120+ employees Global Offices Seattle (HQ), Washington DC, London, Tokyo
  • 4. • Time Series Data • Introducing Riak TS • Data Modeling • Coding with Riak TS Basho Technologies | 4
  • 5. Basho Technologies | 5 What is Time Series?
  • 6. What is Time Series? Basho Technologies | 6
  • 7. What is Time Series? Basho Technologies | 7
  • 8. What is Time Series? Basho Technologies | 8
  • 9. How Is Time Series Data Different? • High performance reads and writes of time series data Basho Technologies | 9 Data location matters Data needs to be easy to retrieve using range queries select * from devices where time >= 2015-08-06 1:00:00 and time <= 2015-08-06 01:10:00 and errorcode = 555123 and device_type = “mobile” Higher write volumes All while still being highly available! With no data loss even with a huge number of sources Eventually rolled up, compressed, with the details expired
  • 10. Introducing Riak TS Basho Technologies | 10 SERVICE INSTANCES STORAGE INSTANCES Solr Spark Redis (Caching) Solr Elastic Search Web Services 3rd Party Web Services & Integrations Riak KV Key/Value Riak S2 Object Storage Riak TS Time Series Document Store Columnar Graph Replication & Synchronization Message Routing Cluster Management & Monitoring Logging & Analytics Internal Data Store CORE SERVICES
  • 11. Riak TS Feature Details Feature Overview Feature Benefit Data co-location by time and geohash or more generally series and data family Easily analyze temporal and geocoded data Configure time series bucket-type that propagates across the cluster using a simple, SQL-like command Simple setup for faster ROI Greater data locality Faster data storage and retrieval Option to store structured and semi- structured data Clean data written to the database eliminating the need to cleanse data Write queries using a subset of SQL Faster application development. Write applications to extract and analyze your data in a familiar language Near-linear scaling Easy to grow database to meet data demands High Availability for ingest No data loss even when data is streaming from a large number of sources Basho Technologies | 11
  • 12. Riak TS Feature Details • Same distributed systems benefits of Riak KV Operational Simplicity Fault Tolerance Robust Client APIs Broad Client Libraries Massive Scalability CRDTs Active Anti-Entropy Masterless High Availability Low Latency Read Repair Riak Search Basho Technologies | 12
  • 13. Riak TS Optimization Basho Technologies | 13 Optimized Deployment • Data Co-Location • Composite Keys - time or geohash, data family • Time quantization (quantum) Simplified Data Modeling • DDL – Table and field definitions support structured and semi- structured data Fast Queries and Analysis • Range Queries (SQL based) • LevelDB filtering • Spark Connector
  • 14. Riak has a masterless architecture in which every node in a cluster is capable of serving read and write requests. Requests are routed to nodes using standard load balancing. Riak TS Optimization Basho Technologies | 14
  • 15. Basho Technologies | 15 Riak KV Hashing
  • 16. Riak KV Hashing PUT Basho Technologies | 16
  • 17. Riak KV Hashing 2i Query Basho Technologies | 17
  • 18. Riak TS Hashing PUT Basho Technologies | 18
  • 19. Riak TS Hashing TS Query Basho Technologies | 19
  • 20. RIAK TS – Storing Structured Data • Key format – Objects have a composite key (partition key and local key) • Tables – Buckets can be defined as tables – Tables can have a schema defined using DDL – Columns in the table can be typed • Data Validation – Data is validated on input Buckets used to Define Tables Basho Technologies | 20
  • 21. RIAK TS – Range Queries • Use Cases – Range queries • Implementation Details – SQL based query language – Filtering rows based on column expressions – Filtering executed in backend – Specific columns are extracted – Simple select with WHERE clause • for numbers <,>=,<,<=,=,!= • for other data types =, != • AND, OR (nesting operators are supported) Query Like SQL select * from devices where time >= 2015-08-06 1:00:00 and time <= 2015-08-06 01:10:00 and errorcode = 555123 and device_type = “mobile” Basho Technologies | 21
  • 22. Data Modeling How does one approach time series data?
  • 23. The first rule… Basho Technologies | 23
  • 24. The real first rule of data modeling: • Decide what questions you want to ask of the data – Graphs? – Granularity? – Analysis? – Monitoring? Basho Technologies | 24
  • 27. Sample Data Exercise Hard drive test data – https://www.backblaze.com/hard-drive-test-data.html – https://en.wikipedia.org/wiki/S.M.A.R.T. Basho Technologies | 27
  • 28. Sample Data Exercise Basho Technologies | 28
  • 29. Data Characteristics [Date, Serial Number, Model, Capacity (bytes), Failure, …, smart_194_raw (Temp), …] Sample Row: • Date: “2013-04-10” • Model: “Hitachi HDS5C3030ALA630” • Failure: 0 • Temp: 26 Which columns are good candidates for indexing given the question we are asking of the data? Basho Technologies | 29
  • 30. Define the Conceptual Query Effect of temperature on hard drive stability Approach 1: SELECT * FROM HardDrives WHERE date >= 2013-01-01 AND date <= 2013-12-31 AND failure = 'true’ “Find all failures in 2013” • Pros: – All data is colocated physically • Cons: – Requires client side processing for further analysis Basho Technologies | 30
  • 31. Create the Table riak-admin bucket-type create HardDrives '{"props":{"n_val":3, "table_def":” CREATE TABLE HardDrives ( date TIMESTAMP NOT NULL, family VARCHAR NOT NULL, failure VARCHAR NOT NULL, serial VARCHAR, model VARCHAR, capacity FLOAT, temperature FLOAT, PRIMARY KEY ( (quantum(date, 1, ‘y'), family, failure), date, family, failure))"}}’ Basho Technologies | 31
  • 32. Ingest the Data RawRow = [ <<“2013-04-10”>>, %% Date <<“MJ0351YNG9Z0XA”>>, %% Serial <<“Hitachi HDS5C3030ALA630”>>, %% Model <<“3000592982016”>>, %% Capacity <<“0”>>, %% Failure …, <<“26”>>, …], %% SMART Stats with Temperature ProcessedRow = [ 1365555661000, %% Date <<“all”>>, %% Family <<“false”>>, %% Failure <<“MJ0351YNG9Z0XA”>>, %% Serial <<“Hitachi HDS5C3030ALA630”>>, %% Model 3000592982016.0, %% Capacity 26.0], %% Temperature Basho Technologies | 32
  • 33. Ingest the Data ProcessedRow = [ convert(lists:nth(1, RawRow), date), % date <<"all">>, % family convert(lists:nth(5, RawRow), boolean), % failure lists:nth(2, RawRow), % serial lists:nth(3, RawRow), % model convert(lists:nth(4, RawRow), float), % capacity convert(lists:nth(51, RawRow), float) % temp ], riakc_ts:put(Pid,<<"HardDrives">>,[ProcessedRow]). Basho Technologies | 33
  • 34. Query the Data Start = integer_to_list(date_to_epoch_ms(<<"2013-01-01">>)), End = integer_to_list(date_to_epoch_ms(<<"2013-12-31">>)), Query = "select * from HardDrives where date >= " ++ Start ++ " and date <= " ++ End ++ " and family = 'all' and failure = 'true'", {_Fields, Results} = riakc_ts:query(Pid, list_to_binary(Query)), Basho Technologies | 34
  • 35. Process the Results Total Failures: 112 Results: [{ 1365555661000, <<"all">>, <<"true">>, <<"9VS3FM1J">>, <<"ST31500341AS">>, 1500301910016.0, 31.0 }, {...}, {...}, ... ] Basho Technologies | 35
  • 36. Results 130> ts:approach1(). Total Failures: 112 "ST31500341AS": ... "ST3000DM001": ... "Hitachi HDS5C4040ALE630": ... "ST4000DM000": ... "ST31500541AS": 18.0=1 19.0=1 20.0=2 21.0=3 22.0=2 24.0=2 25.0=1 29.0=3 30.0=1 Basho Technologies | 36
  • 37. Refine the Query New Query SELECT * FROM HardDrives WHERE date >= 2013-01-01 AND date <= 2013-12-31 AND model = ‘ST31500541AS‘ AND failure = 'true’ New Primary Key PRIMARY KEY ( (quantum(date, 1, ‘y'), model, failure), date, model, failure))"}}’ Same (but more focused) Results "ST31500541AS": 18.0=1 19.0=1 20.0=2 21.0=3 22.0=2 24.0=2 25.0=1 29.0=3 30.0=1 Basho Technologies | 37
  • 38. Think Outside the Box New Approach: Multi-Model with Riak KV Conceptual Query: Read the single value of a bunch of counters! “Find the number of failures for each Quantum, Model, and Temperature combination” • Pros: – Each data point is pre-calculated, so very little client side processing – Potentially faster, depending on a lot of variables • Cons: – Requires the desire to know very specific stat values prior to writing data – Requires several counter writes for every row of raw data Basho Technologies | 38
  • 39. Create the Bucket Type riak-admin bucket-type create HardDriveCounters '{"props":{"datatype":"counter"}}’ Basho Technologies | 39
  • 40. Ingest the Data Failure = lists:nth(5, RawRow), % failure Year = extract_year(lists:nth(1, RawRow), % year Temp = lists:nth(51, RawRow), Bucket = {<<"HardDriveCounters">>,Year}, Key = list_to_binary(binary_to_list(Model) ++ binary_to_list(Temp)), %% We only care about failures case Failure of <<“1”>> -> Counter = riakc_counter:new(), Counter1 = riakc_counter:increment(Counter), riakc_pb_socket:update_type(Pid,Bucket,Key, riakc_counter:to_op(Counter1)) _ -> ok end. Basho Technologies | 40
  • 41. Query the Data StartTemp = 16, EndTemp = 28, Results = range_get(<<“2013”>>, <<“ST31500341AS”>>, StartTemp, EndTemp, []). ... range_get(_Year, _Model, EndTemp, EndTemp, Accum) -> lists:reverse(Accum); range_get(Year, Model, CurrentTemp, EndTemp, Accum) -> Bucket = {<<"HardDriveCounters">>,Year}, Key = list_to_binary(binary_to_list(Model) ++ integer_to_list(Temp)), {ok, Counter} = riakc_pb_socket:fetch_type(Pid,Bucket, Key), NumFailures = riakc_counter:value(Counter), range_get(Year, Model, CurrentTemp + 1, EndTemp, [{CurrentTemp, NumFailures}|Accum]). Basho Technologies | 41
  • 42. Data Modeling in Riak Multi-Model with Riak KV • Keys: Create your own using quantum + “dimension” • Range Queries: Create your own client side multi-get to issue incremental key gets • Compound Queries: Create more composite keys! • Data Location: Sometimes inefficient because data is spread across many vnodes / partitions Basho Technologies | 42
  • 43. Data Modeling in Riak Time Series Modeling in Riak TS • Keys: Automatically managed based on your PRIMARY KEY definition as well as the values in those fields • Range Queries: Use a well known subset of SQL to simply specify a start and end in a WHERE clause which performs a server side multi-get • Compound Queries: Possible with a wisely chosen composite PRIMARY KEY, although multiple tables may still be necessary • Data Location: Very efficient data grouping by quantums, families, and series. Basho Technologies | 43
  • 45. Part of the Basho Data Platform Basho Technologies | 45 SERVICE INSTANCES STORAGE INSTANCES Solr Spark Redis (Caching) Solr Elastic Search Web Services 3rd Party Web Services & Integrations Riak KV Key/Value Riak S2 Object Storage Riak TS Time Series Document Store Columnar Graph Replication & Synchronization Message Routing Cluster Management & Monitoring Logging & Analytics Internal Data Store CORE SERVICES
  • 46. RIAK TS Feature Details Feature Overview Feature Benefit Data co-location by time and geohash or more generally series and data family Easily analyze temporal and geocoded data Configure time series bucket-type that propagates across the cluster using a simple, SQL-like command Simple setup for faster ROI Greater data locality Faster data storage and retrieval Option to store structured and semi- structured data Clean data written to the database eliminating the need to cleanse data Write queries using a subset of SQL Faster application development. Write applications to extract and analyze your data in a familiar language Near-linear scaling Easy to grow database to meet data demands High Availability for ingest No data loss even when data is streaming from a large number of sources Basho Technologies | 46
  • 48. Spend Time @basho @riconconf OPEN SOURCE ENTERPRISE Basho Data Platform (code) • Riak KV with parallel extract Basho Data Platform, Enterprise • Riak EE with multi-cluster replication • Spark Leader Election Service Basho Data Platform Add-on’s (code) • Spark + Spark Connector Basho Data Platform Add-on’s • Redis + Cache Proxy • Spark Workers + Spark Master Download a build Contact us to get started getting to know us Basho Technologies | 48