SlideShare a Scribd company logo
ClickHouse for
Experimentation
Gleb Kanterov
@kanterov
2018-07-03
170M Monthly Active Users
75M Subscribers
35M Tracks
65 Markets
[1] https://investors.spotify.com
Quick Facts
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Ask for forgiveness, not for permission
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Ask for forgiveness, not for permission
AUTONOMY
Hadoop@Spotify
● On-Premise
● 2,500 nodes
● 100 PB Disk
● 100 TB RAM
● 100B+ events per day
● 20K+ jobs per day
Hadoop@Spotify
● Migration from On-Premise to GCP
● Moved 100 PB of data
● Our Hadoop cluster is dead
Hadoop@Spotify
What are
experiments,
and why
ClickHouse?
Randomized
Controlled
Experiment
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
An experiment where all subjects
involved in the experiment are treated
the same except for one deviation.
One variable is changed in order to
isolate the results.
All Khan Academy content is available for free at www.khanacademy.org
A/B Testing
A/B Testing is a randomized controlled experiment where one variable is tested.
E.g., hypothesis Our new recommendation algorithm increases content consumption.
How to verify?
1. Formulate hypothesis
2. Run A/B test
3. See if there is a statistically significant increase in consumption.
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
1. Event Delivery
Developers instrument
applications and services
using SDK.
Events are collected and
published to Pub/Sub.
Batch jobs read data from
Pub/Sub, deduplicate and
anonymize, and then store in
hourly partitions on GCS.
Exposing users to
experiments, and configuring
A/B variations on clients is
done by dedicates services.
Product
Owners
Data
Scientists
Granular Data
BigQuery
1
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
2. Data Pipelines and
Storage
Data gets transformed and
aggregated using Dataflow
batch jobs, and stored in
Bigtable, GCS and BigQuery.
Bigtable contains
pre-computed aggregated
experiment results.
BigQuery has granular data
used in ad-hoc analysis.
Product
Owners
Data
Scientists
Granular Data
BigQuery
2
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
3. Presentation
Users of Experimentation
platform see their experiment
results in web application.
Statistical tests and health
checks are performed
automatically.
Metrics for Experimentation Platform v1
Product
Owners
Data
Scientists
Granular Data
BigQuery
3
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
4. Ad-hoc Analytics
Data scientists do ad-hoc
exploration in Jupyter
notebooks using BigQuery.
Here they answer experiment
specific-questions, not
automatically supported by
experimentation system.
Product
Owners
Data
Scientists
Granular Data
BigQuery
4
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
What works well
Centralized team owning
100-s of core metrics.
Automatic experiment
analysis and planning.
Allows to conclude
experiments without manual
analysis. Autonomous feature
teams can move fast and
iterate on their product.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Problems
Not every metric worths
centralization.
Centralized team became a
bottleneck for Feature
features.
As a result, too much
repetitive work goes into
notebooks and ad-hoc
queries.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Granular Data
OLAP Database
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Reasons
1. Experimentation isn’t only
about hypothesis testing, but
learning from experiments.
Aggregated data in Bigtable
wasn’t granular enough, and
didn't have enough
dimensions.
2. Can’t add a new metric
without involving a central
team.
What we want
Provide teams more granular
data out of the box, and give
a way to define a new metric.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Requirements
● Serve 100-s of QPS with sub-second latency
● We know in advance what are queries and data
● Maintain 10x metrics with the same cost
● Thousands of metrics
● Billions of rows per day in each of 100-s of tables
● Ready to be used out of the box
● Leverage existing infrastructure as much as feasible
● Hide unnecessary complexity from internal users
What about BigQuery?
● Supports Standard SQL
● Don’t have to optimize datasets in advance
● Works great for heavy queries with joins among multiple datasets
● Doesn’t need operations and machines running
● Good for interactive ad-hoc queries (~ minutes)
● Isn’t best for a high amount of low-latency queries you are aware in advance
Why ClickHouse?
● Build proof of concept using various OLAP storages (ClickHouse, Druid, Pinot, ...)
● ClickHouse has the most simple architecture
● Powerful SQL dialect close to Standard SQL
● A comprehensive set of built-in functions and aggregators
● Was ready to be used out of the box
● Superset integration is great
● Easy to query using clickhouse-jdbc and jooq
Event Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Product TeamsStorage
Granular Data
ClickHouse
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Metrics for Experimentation Platform v2
5. ClickHouse
Interactive queries on
granular data.
Reduce demand in notebooks
and BigQuery with
dashboards and exploration
in Superset.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Superset
5
Event Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Product TeamsStorage
Granular Data
ClickHouse
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Metrics for Experimentation Platform v2
6. Metrics Catalog
Centralized place for teams to
define their own metrics.
20 minutes to define a metric. Product
Owners
Data
Scientists
Granular Data
BigQuery
Superset
Metrics Catalog
Metrics API
Metric
definitions
6
What we have built
● Own DSL to define metrics, and centralized metrics catalog
● Expressive and simple model that we can efficiently scale to 1000-s of metrics
● Generalize existing components to work with Metrics DSL
○ data preparation and ingestion into ClickHouse
○ denormalization with conformed dimensions
○ create dashboards, tables and charts in Superset
○ do statistical tests, and expose results through API
○ define ownership, tiering, and other attributes
○ integrates with the rest of infrastructure for alerting, monitoring,
data quality, anomaly detection, access control & etc
● Users don’t work with ClickHouse SQL, or need to know how it works
● API to query metrics and metadata
Ingestion to ClickHouse
● Move data from GCS to ClickHouse
● Use clickhouse-jdbc, custom code and RowBinary format
● Use daily partitioning, and ingest once a day
● 1 hour to ingest 5 TiB on test cluster using 9 n1-standard-32 with 8 NVMe SSD RAID0
● Don’t use materialized views in ClickHouse
● Offload most of computations to batch data pipelines due to scalability, experience and
tooling
● TODO try ClickHouse-Native-JDBC
● TODO pre-sort in data pipelines before ingesting
What is next
● Do lambda-style ingestion for subset of metrics with low-latency requirements
● Add more aggregations to DSL (e.g. 5 statistical moments)
● Add custom chart types to Superset
● Try ClickHouse for similar use cases within Spotify
Using ClickHouse for Experimentation

More Related Content

PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PDF
Introducing Change Data Capture with Debezium
PPT
Machine learning
PPTX
Competency iceberg-model
PPTX
Mustard
PDF
Grokking Techtalk #45: First Principles Thinking
PPT
Lecture 01 introduction to database
PDF
Introduction to Prompt Engineering (Focusing on ChatGPT)
ClickHouse Deep Dive, by Aleksei Milovidov
Introducing Change Data Capture with Debezium
Machine learning
Competency iceberg-model
Mustard
Grokking Techtalk #45: First Principles Thinking
Lecture 01 introduction to database
Introduction to Prompt Engineering (Focusing on ChatGPT)

What's hot (20)

PPTX
High Performance, High Reliability Data Loading on ClickHouse
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
PDF
Better than you think: Handling JSON data in ClickHouse
PDF
cLoki: Like Loki but for ClickHouse
PDF
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
PDF
Your first ClickHouse data warehouse
PDF
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
PDF
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
PDF
Altinity Quickstart for ClickHouse
PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
PDF
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
PDF
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
PDF
Kafka Streams State Stores Being Persistent
PDF
[Meetup] a successful migration from elastic search to clickhouse
PDF
All about Zookeeper and ClickHouse Keeper.pdf
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
PDF
Understanding Query Plans and Spark UIs
High Performance, High Reliability Data Loading on ClickHouse
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Better than you think: Handling JSON data in ClickHouse
cLoki: Like Loki but for ClickHouse
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Your first ClickHouse data warehouse
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
Altinity Quickstart for ClickHouse
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
Kafka Streams State Stores Being Persistent
[Meetup] a successful migration from elastic search to clickhouse
All about Zookeeper and ClickHouse Keeper.pdf
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Understanding Query Plans and Spark UIs
Ad

Similar to Using ClickHouse for Experimentation (20)

PPTX
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
PDF
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
PDF
Google BigQuery for Everyday Developer
PPTX
Applying linear regression and predictive analytics
PDF
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
PDF
Big Query Basics
PPTX
bigquery.pptx
PPTX
Gimel at Teradata Analytics Universe 2018
PDF
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
PPTX
Google Developer Group - Cloud Singapore BigQuery Webinar
PDF
OpenMetadata Community Meeting - 4th April, 2024
PPTX
Group 3 slide presentation
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
PDF
Complex realtime event analytics using BigQuery @Crunch Warmup
PDF
PCM18 (Big Data Analytics)
PDF
Big Data Ready Enterprise
PDF
Druid @ branch
PDF
Data Science in the Cloud @StitchFix
PDF
Big query
PDF
Big Trends in Big Data
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Google BigQuery for Everyday Developer
Applying linear regression and predictive analytics
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
Big Query Basics
bigquery.pptx
Gimel at Teradata Analytics Universe 2018
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Google Developer Group - Cloud Singapore BigQuery Webinar
OpenMetadata Community Meeting - 4th April, 2024
Group 3 slide presentation
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Complex realtime event analytics using BigQuery @Crunch Warmup
PCM18 (Big Data Analytics)
Big Data Ready Enterprise
Druid @ branch
Data Science in the Cloud @StitchFix
Big query
Big Trends in Big Data
Ad

Recently uploaded (20)

PDF
medical staffing services at VALiNTRY
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Nekopoi APK 2025 free lastest update
PPTX
assetexplorer- product-overview - presentation
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Complete Guide to Website Development in Malaysia for SMEs
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
AutoCAD Professional Crack 2025 With License Key
medical staffing services at VALiNTRY
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Design an Analysis of Algorithms II-SECS-1021-03
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
wealthsignaloriginal-com-DS-text-... (1).pdf
CHAPTER 2 - PM Management and IT Context
Nekopoi APK 2025 free lastest update
assetexplorer- product-overview - presentation
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Complete Guide to Website Development in Malaysia for SMEs
Oracle Fusion HCM Cloud Demo for Beginners
Odoo Companies in India – Driving Business Transformation.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Autodesk AutoCAD Crack Free Download 2025
AutoCAD Professional Crack 2025 With License Key

Using ClickHouse for Experimentation

  • 2. 170M Monthly Active Users 75M Subscribers 35M Tracks 65 Markets [1] https://investors.spotify.com Quick Facts
  • 3. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads
  • 4. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things
  • 5. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things Ask for forgiveness, not for permission
  • 6. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things Ask for forgiveness, not for permission AUTONOMY
  • 8. ● On-Premise ● 2,500 nodes ● 100 PB Disk ● 100 TB RAM ● 100B+ events per day ● 20K+ jobs per day Hadoop@Spotify
  • 9. ● Migration from On-Premise to GCP ● Moved 100 PB of data ● Our Hadoop cluster is dead Hadoop@Spotify
  • 12. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 13. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 14. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 15. Randomized Controlled Experiment An experiment where all subjects involved in the experiment are treated the same except for one deviation. One variable is changed in order to isolate the results. All Khan Academy content is available for free at www.khanacademy.org
  • 16. A/B Testing A/B Testing is a randomized controlled experiment where one variable is tested. E.g., hypothesis Our new recommendation algorithm increases content consumption. How to verify? 1. Formulate hypothesis 2. Run A/B test 3. See if there is a statistically significant increase in consumption.
  • 17. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Product Owners Data Scientists Granular Data BigQuery
  • 18. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 1. Event Delivery Developers instrument applications and services using SDK. Events are collected and published to Pub/Sub. Batch jobs read data from Pub/Sub, deduplicate and anonymize, and then store in hourly partitions on GCS. Exposing users to experiments, and configuring A/B variations on clients is done by dedicates services. Product Owners Data Scientists Granular Data BigQuery 1
  • 19. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 2. Data Pipelines and Storage Data gets transformed and aggregated using Dataflow batch jobs, and stored in Bigtable, GCS and BigQuery. Bigtable contains pre-computed aggregated experiment results. BigQuery has granular data used in ad-hoc analysis. Product Owners Data Scientists Granular Data BigQuery 2
  • 20. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine 3. Presentation Users of Experimentation platform see their experiment results in web application. Statistical tests and health checks are performed automatically. Metrics for Experimentation Platform v1 Product Owners Data Scientists Granular Data BigQuery 3
  • 21. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 4. Ad-hoc Analytics Data scientists do ad-hoc exploration in Jupyter notebooks using BigQuery. Here they answer experiment specific-questions, not automatically supported by experimentation system. Product Owners Data Scientists Granular Data BigQuery 4
  • 22. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 What works well Centralized team owning 100-s of core metrics. Automatic experiment analysis and planning. Allows to conclude experiments without manual analysis. Autonomous feature teams can move fast and iterate on their product. Product Owners Data Scientists Granular Data BigQuery
  • 23. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Problems Not every metric worths centralization. Centralized team became a bottleneck for Feature features. As a result, too much repetitive work goes into notebooks and ad-hoc queries. Product Owners Data Scientists Granular Data BigQuery
  • 24. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Granular Data OLAP Database Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Reasons 1. Experimentation isn’t only about hypothesis testing, but learning from experiments. Aggregated data in Bigtable wasn’t granular enough, and didn't have enough dimensions. 2. Can’t add a new metric without involving a central team. What we want Provide teams more granular data out of the box, and give a way to define a new metric. Product Owners Data Scientists Granular Data BigQuery
  • 25. Requirements ● Serve 100-s of QPS with sub-second latency ● We know in advance what are queries and data ● Maintain 10x metrics with the same cost ● Thousands of metrics ● Billions of rows per day in each of 100-s of tables ● Ready to be used out of the box ● Leverage existing infrastructure as much as feasible ● Hide unnecessary complexity from internal users
  • 26. What about BigQuery? ● Supports Standard SQL ● Don’t have to optimize datasets in advance ● Works great for heavy queries with joins among multiple datasets ● Doesn’t need operations and machines running ● Good for interactive ad-hoc queries (~ minutes) ● Isn’t best for a high amount of low-latency queries you are aware in advance
  • 27. Why ClickHouse? ● Build proof of concept using various OLAP storages (ClickHouse, Druid, Pinot, ...) ● ClickHouse has the most simple architecture ● Powerful SQL dialect close to Standard SQL ● A comprehensive set of built-in functions and aggregators ● Was ready to be used out of the box ● Superset integration is great ● Easy to query using clickhouse-jdbc and jooq
  • 28. Event Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Product TeamsStorage Granular Data ClickHouse Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Metrics for Experimentation Platform v2 5. ClickHouse Interactive queries on granular data. Reduce demand in notebooks and BigQuery with dashboards and exploration in Superset. Product Owners Data Scientists Granular Data BigQuery Superset 5
  • 29. Event Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Product TeamsStorage Granular Data ClickHouse Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Metrics for Experimentation Platform v2 6. Metrics Catalog Centralized place for teams to define their own metrics. 20 minutes to define a metric. Product Owners Data Scientists Granular Data BigQuery Superset Metrics Catalog Metrics API Metric definitions 6
  • 30. What we have built ● Own DSL to define metrics, and centralized metrics catalog ● Expressive and simple model that we can efficiently scale to 1000-s of metrics ● Generalize existing components to work with Metrics DSL ○ data preparation and ingestion into ClickHouse ○ denormalization with conformed dimensions ○ create dashboards, tables and charts in Superset ○ do statistical tests, and expose results through API ○ define ownership, tiering, and other attributes ○ integrates with the rest of infrastructure for alerting, monitoring, data quality, anomaly detection, access control & etc ● Users don’t work with ClickHouse SQL, or need to know how it works ● API to query metrics and metadata
  • 31. Ingestion to ClickHouse ● Move data from GCS to ClickHouse ● Use clickhouse-jdbc, custom code and RowBinary format ● Use daily partitioning, and ingest once a day ● 1 hour to ingest 5 TiB on test cluster using 9 n1-standard-32 with 8 NVMe SSD RAID0 ● Don’t use materialized views in ClickHouse ● Offload most of computations to batch data pipelines due to scalability, experience and tooling ● TODO try ClickHouse-Native-JDBC ● TODO pre-sort in data pipelines before ingesting
  • 32. What is next ● Do lambda-style ingestion for subset of metrics with low-latency requirements ● Add more aggregations to DSL (e.g. 5 statistical moments) ● Add custom chart types to Superset ● Try ClickHouse for similar use cases within Spotify