SlideShare a Scribd company logo
OLAP WITH SPARK AND 
CASSANDRA 
#CassandraSummit 
EVAN CHAN 
SEPT 2014
WHO AM I? 
Principal Engineer, 
@evanfchan 
Creator of 
Socrata, Inc. 
http://github.com/velvia 
Spark Job Server
WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE 
PEOPLE. 
data.edmonton.ca finances.worldbank.org data.cityofchicago.org 
data.seattle.gov data.oregon.gov data.wa.gov 
www.metrochicagodata.org data.cityofboston.gov 
info.samhsa.gov explore.data.gov data.cms.gov data.ok.gov 
data.nola.gov data.illinois.gov data.colorado.gov 
data.austintexas.gov data.undp.org www.opendatanyc.com 
data.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.it 
data.montgomerycountymd.gov data.cityofnewyork.us 
data.acgov.org data.baltimorecity.gov data.energystar.gov 
data.somervillema.gov data.maryland.gov data.taxpayer.net 
bronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
WE ARE SWIMMING IN DATA!
BIG DATA AT SOCRATA 
Tens of thousands of datasets, each one up to 30 million rows 
Customer demand for billion row datasets 
Want to analyze across datasets
BIG DATA AT OOYALA 
2.5 billion analytics pings a day = almost a trillion events a 
year. 
Roll up tables - 30 million rows per day
HOW CAN WE ALLOW CUSTOMERS TO QUERY A 
YEAR'S WORTH OF DATA? 
Flexible - complex queries included 
Sometimes you can't denormalize your data enough 
Fast - interactive speeds 
Near Real Time - can't make customers wait hours before 
querying new data
RDBMS? POSTGRES? 
Start hitting latency limits at ~10 million rows 
No robust and inexpensive solution for querying across shards 
No robust way to scale horizontally 
PostGres runs query on single thread unless you partition 
(painful!) 
Complex and expensive to improve performance (eg rollup 
tables, huge expensive servers)
OLAP CUBES? 
Materialize summary for every possible combination 
Too complicated and brittle 
Takes forever to compute - not for real time 
Explodes storage and memory
When in doubt, use brute force 
- Ken Thompson
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
CASSANDRA 
Horizontally scalable 
Very flexible data modelling (lists, sets, custom data types) 
Easy to operate 
No fear of number of rows or documents 
Best of breed storage technology, huge community 
BUT: Simple queries only
APACHE SPARK 
Horizontally scalable, in-memory queries 
Functional Scala transforms - map, filter, groupBy, sort 
etc. 
SQL, machine learning, streaming, graph, R, many more plugins 
all on ONE platform - feed your SQL results to a logistic 
regression, easy! 
THE Hottest big data platform, huge community, leaving 
Hadoop in the dust 
Developers love it
SPARK PROVIDES THE MISSING FAST, DEEP 
ANALYTICS PIECE OF CASSANDRA!
INTEGRATING SPARK AND CASSANDRA 
Scala solutions: 
Datastax integration: 
https://github.com/datastax/spark-cassandra- 
connector 
(CQL-based) 
Calliope
A bit more work: 
Use traditional Cassandra client with RDDs 
Use an existing InputFormat, like CqlPagedInputFormat 
Only reason to go here is probably you are not on CQL version of 
Cassandra, or you're using Shark/Hive.
A SPARK AND CASSANDRA 
OLAP ARCHITECTURE
SEPARATE STORAGE AND QUERY LAYERS 
Combine best of breed storage and query platforms 
Take full advantage of evolution of each 
Storage handles replication for availability 
Query can replicate data for scaling read concurrency - 
independent!
SCALE NODES, NOT 
DEVELOPER TIME!!
KEEPING IT SIMPLE 
Maximize row scan speed 
Columnar representation for efficiency 
Compressed bitmap indexes for fast algebra 
Functional transforms for easy memoization, testing, 
concurrency, composition
SPARK AS CASSANDRA'S CACHE
EVEN BETTER: TACHYON OFF-HEAP CACHING
INITIAL ATTEMPTS 
val rows = Seq( 
Seq("Burglary", "19xx Hurston", 10), 
Seq("Theft", "55xx Floatilla Ave", 5) 
) 
sc.parallelize(rows) 
.map { values => (values[0], values) } 
.groupByKey 
.reduce(_[2] + _[2])
No existing generic query engine for Spark when we started 
(Shark was in infancy, had no indexes, etc.), so we built our own 
For every row, need to extract out needed columns 
Ability to select arbitrary columns means using Seq[Any], no 
type safety 
Boxing makes integer aggregation very expensive and memory 
inefficient
COLUMNAR STORAGE AND QUERYING
The traditional row-based data storage 
approach is dead 
- Michael Stonebraker
TRADITIONAL ROW-BASED STORAGE 
Same layout in memory and on disk: 
Name Age 
Barak 46 
Hillary 66 
Each row is stored contiguously. All columns in row 2 come after 
row 1.
COLUMNAR STORAGE (MEMORY) 
Name column 
0 1 
0 1 
Dictionary: {0: "Barak", 1: "Hillary"} 
Age column 
0 1 
46 66
COLUMNAR STORAGE (CASSANDRA) 
Review: each physical row in Cassandra (e.g. a "partition key") 
stores its columns together on disk. 
Schema CF 
Rowkey Type 
Name StringDict 
Age Int 
Data CF 
Rowkey 0 1 
Name 0 1 
Age 46 66
ADVANTAGES OF COLUMNAR STORAGE 
Compression 
Dictionary compression - HUGE savings for low-cardinality 
string columns 
RLE 
Reduce I/O 
Only columns needed for query are loaded from disk 
Can keep strong types in memory, avoid boxing 
Batch multiple rows in one cell for efficiency
ADVANTAGES OF COLUMNAR QUERYING 
Cache locality for aggregating column of data 
Take advantage of CPU/GPU vector instructions for ints / 
doubles 
avoid row-ifying until last possible moment 
easy to derive computed columns 
Use vector data / linear math libraries
COLUMNAR QUERY ENGINE VS ROW-BASED IN 
SCALA 
Custom RDD of column-oriented blocks of data 
Uses ~10x less heap 
10-100x faster for group by's on a single node 
Scan speed in excess of 150M rows/sec/core for integer 
aggregations
SO, GREAT, OLAP WITH CASSANDRA AND 
SPARK. NOW WHAT?
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DATASTAX: CASSANDRA SPARK INTEGRATION 
Datastax Enterprise now comes with HA Spark 
HA master, that is. 
spark-cassandra-connector
SPARK SQL 
Appeared with Spark 1.0 
In-memory columnar store 
Can read from Parquet and JSON now; direct Cassandra 
integration coming 
Querying is not column-based (yet) 
No indexes 
Write custom functions in Scala .... take that Hive UDFs!! 
Integrates well with MLBase, Scala/Java/Python
CACHING A SQL TABLE FROM CASSANDRA 
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") 
.registerAsTable("gdelt") 
sqlContext.cacheTable("gdelt") 
sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the 
collect() 
In Spark 1.1+: registerTempTable
SOME PERFORMANCE NUMBERS 
GDELT dataset, 117 million rows, 57 columns, ~50GB 
Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory 
Query Avg 
time 
(sec) 
SELECT count(*) FROM gdelt 
WHERE Actor2CountryCode = 
'CHN' 
0.49 
SELECT 4 columns Top K 1.51 
SELECT Top countries by Avg Tone 
2.69 
(Group By)
IMPORTANT - CACHING 
By default, queries will read data from source - Cassandra - 
every time 
Spark RDD Caching - much faster, but big waste of memory 
(row oriented) 
Spark SQL table caching - fastest, memory efficient
WORK STILL NEEDED 
Indexes 
Columnar querying for fast aggregation 
Tachyon support for Cassandra/CQL 
Efficient reading from columnar storage formats
LESSONS 
Extremely fast distributed querying for these use cases 
Data doesn't change much (and only bulk changes) 
Analytical queries for subset of columns 
Focused on numerical aggregations 
Small numbers of group bys 
For fast query performance, cache your data using Spark SQL 
Concurrent queries is a frontier with Spark. Use additional 
Spark contexts.
THANK YOU!
EXTRA SLIDES
EXAMPLE CUSTOM INTEGRATION USING 
ASTYANAX 
val cassRDD = sc.parallelize(rowkeys). 
flatMap { rowkey => 
columnFamily.get(rowkey).execute().asScala 
}
SOME COLUMNAR ALTERNATIVES 
Monetdb and Infobright - true columnar stores (storage + 
querying) 
Vertica and C-Store 
Google BigQuery - columnar cloud database, Dremel based 
Amazon RedShift

More Related Content

PDF
OLAP with Cassandra and Spark
PDF
Breakthrough OLAP performance with Cassandra and Spark
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
PDF
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
PDF
Using Apache Spark as ETL engine. Pros and Cons
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
OLAP with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Analyzing Time Series Data with Apache Spark and Cassandra
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Using Apache Spark as ETL engine. Pros and Cons
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark

What's hot (19)

PDF
Analytics with Cassandra & Spark
PDF
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PPTX
Spark SQL
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
Apache Spark and DataStax Enablement
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PPTX
Hadoop and Spark for the SAS Developer
PPTX
ETL with SPARK - First Spark London meetup
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
PDF
Tachyon and Apache Spark
PDF
Spark Cassandra Connector: Past, Present, and Future
PDF
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
PDF
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
PDF
Scale-Out Using Spark in Serverless Herd Mode!
Analytics with Cassandra & Spark
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Spark SQL
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Apache Spark and DataStax Enablement
Processing Large Data with Apache Spark -- HasGeek
Spark Summit East 2015 Advanced Devops Student Slides
Hadoop and Spark for the SAS Developer
ETL with SPARK - First Spark London meetup
Spark And Cassandra: 2 Fast, 2 Furious
Koalas: Making an Easy Transition from Pandas to Apache Spark
Tachyon and Apache Spark
Spark Cassandra Connector: Past, Present, and Future
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Hadoop Strata Talk - Uber, your hadoop has arrived
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Scale-Out Using Spark in Serverless Herd Mode!
Ad

Viewers also liked (20)

PDF
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
PPTX
BDM8 - Near-realtime Big Data Analytics using Impala
PDF
Overiew of Cassandra and Doradus
PPTX
Extending Cassandra with Doradus OLAP for High Performance Analytics
PPTX
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
PPTX
Big Data-Driven Applications with Cassandra and Spark
PDF
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
PDF
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
PDF
Apache Cassandra at Narmal 2014
PDF
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
PDF
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
PDF
Introduction to Dating Modeling for Cassandra
PPTX
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
PDF
Coursera's Adoption of Cassandra
PDF
Cassandra Summit 2014: Monitor Everything!
PDF
Production Ready Cassandra (Beginner)
PDF
Cassandra Summit 2014: The Cassandra Experience at Orange — Season 2
PDF
The Last Pickle: Distributed Tracing from Application to Database
PDF
New features in 3.0
PDF
Introduction to .Net Driver
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
BDM8 - Near-realtime Big Data Analytics using Impala
Overiew of Cassandra and Doradus
Extending Cassandra with Doradus OLAP for High Performance Analytics
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Big Data-Driven Applications with Cassandra and Spark
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Apache Cassandra at Narmal 2014
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Introduction to Dating Modeling for Cassandra
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
Coursera's Adoption of Cassandra
Cassandra Summit 2014: Monitor Everything!
Production Ready Cassandra (Beginner)
Cassandra Summit 2014: The Cassandra Experience at Orange — Season 2
The Last Pickle: Distributed Tracing from Application to Database
New features in 3.0
Introduction to .Net Driver
Ad

Similar to Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark (20)

PDF
Olap with Spark and Cassandra
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
PDF
Apache spark - Architecture , Overview & libraries
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PDF
SnappyData Overview Slidedeck for Big Data Bellevue
PDF
Big data analytics with Spark & Cassandra
PPTX
SnappyData overview NikeTechTalk 11/19/15
PPTX
Nike tech talk.2
PPTX
Intro to Spark
PDF
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
PDF
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
PDF
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
PPTX
Cassandra implementation for collecting data and presenting data
PDF
Apache Spark: The Analytics Operating System
Olap with Spark and Cassandra
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
5 Ways to Use Spark to Enrich your Cassandra Environment
Apache spark - Architecture , Overview & libraries
Big data vahidamiri-tabriz-13960226-datastack.ir
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
SnappyData Overview Slidedeck for Big Data Bellevue
Big data analytics with Spark & Cassandra
SnappyData overview NikeTechTalk 11/19/15
Nike tech talk.2
Intro to Spark
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Cassandra implementation for collecting data and presenting data
Apache Spark: The Analytics Operating System

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
PPTX
Introduction to DataStax Enterprise Graph Database
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
PPTX
Cassandra on Docker @ Walmart Labs
PDF
Cassandra 3.0 Data Modeling
PPTX
Cassandra Adoption on Cisco UCS & Open stack
PDF
Data Modeling for Apache Cassandra
PDF
Coursera Cassandra Driver
PDF
Production Ready Cassandra
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
PDF
Standing Up Your First Cluster
PDF
Real Time Analytics with Dse
PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Cassandra Core Concepts
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PPTX
Bad Habits Die Hard
PDF
Advanced Data Modeling with Apache Cassandra
PDF
Advanced Cassandra
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Cassandra on Docker @ Walmart Labs
Cassandra 3.0 Data Modeling
Cassandra Adoption on Cisco UCS & Open stack
Data Modeling for Apache Cassandra
Coursera Cassandra Driver
Production Ready Cassandra
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 2
Standing Up Your First Cluster
Real Time Analytics with Dse
Introduction to Data Modeling with Apache Cassandra
Cassandra Core Concepts
Enabling Search in your Cassandra Application with DataStax Enterprise
Bad Habits Die Hard
Advanced Data Modeling with Apache Cassandra
Advanced Cassandra

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
sap open course for s4hana steps from ECC to s4
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Machine learning based COVID-19 study performance prediction
PPTX
A Presentation on Artificial Intelligence
PDF
Spectral efficient network and resource selection model in 5G networks
A comparative analysis of optical character recognition models for extracting...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
sap open course for s4hana steps from ECC to s4
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Machine learning based COVID-19 study performance prediction
A Presentation on Artificial Intelligence
Spectral efficient network and resource selection model in 5G networks

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

  • 1. OLAP WITH SPARK AND CASSANDRA #CassandraSummit EVAN CHAN SEPT 2014
  • 2. WHO AM I? Principal Engineer, @evanfchan Creator of Socrata, Inc. http://github.com/velvia Spark Job Server
  • 3. WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE PEOPLE. data.edmonton.ca finances.worldbank.org data.cityofchicago.org data.seattle.gov data.oregon.gov data.wa.gov www.metrochicagodata.org data.cityofboston.gov info.samhsa.gov explore.data.gov data.cms.gov data.ok.gov data.nola.gov data.illinois.gov data.colorado.gov data.austintexas.gov data.undp.org www.opendatanyc.com data.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.it data.montgomerycountymd.gov data.cityofnewyork.us data.acgov.org data.baltimorecity.gov data.energystar.gov data.somervillema.gov data.maryland.gov data.taxpayer.net bronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
  • 4. WE ARE SWIMMING IN DATA!
  • 5. BIG DATA AT SOCRATA Tens of thousands of datasets, each one up to 30 million rows Customer demand for billion row datasets Want to analyze across datasets
  • 6. BIG DATA AT OOYALA 2.5 billion analytics pings a day = almost a trillion events a year. Roll up tables - 30 million rows per day
  • 7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A YEAR'S WORTH OF DATA? Flexible - complex queries included Sometimes you can't denormalize your data enough Fast - interactive speeds Near Real Time - can't make customers wait hours before querying new data
  • 8. RDBMS? POSTGRES? Start hitting latency limits at ~10 million rows No robust and inexpensive solution for querying across shards No robust way to scale horizontally PostGres runs query on single thread unless you partition (painful!) Complex and expensive to improve performance (eg rollup tables, huge expensive servers)
  • 9. OLAP CUBES? Materialize summary for every possible combination Too complicated and brittle Takes forever to compute - not for real time Explodes storage and memory
  • 10. When in doubt, use brute force - Ken Thompson
  • 12. CASSANDRA Horizontally scalable Very flexible data modelling (lists, sets, custom data types) Easy to operate No fear of number of rows or documents Best of breed storage technology, huge community BUT: Simple queries only
  • 13. APACHE SPARK Horizontally scalable, in-memory queries Functional Scala transforms - map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, many more plugins all on ONE platform - feed your SQL results to a logistic regression, easy! THE Hottest big data platform, huge community, leaving Hadoop in the dust Developers love it
  • 14. SPARK PROVIDES THE MISSING FAST, DEEP ANALYTICS PIECE OF CASSANDRA!
  • 15. INTEGRATING SPARK AND CASSANDRA Scala solutions: Datastax integration: https://github.com/datastax/spark-cassandra- connector (CQL-based) Calliope
  • 16. A bit more work: Use traditional Cassandra client with RDDs Use an existing InputFormat, like CqlPagedInputFormat Only reason to go here is probably you are not on CQL version of Cassandra, or you're using Shark/Hive.
  • 17. A SPARK AND CASSANDRA OLAP ARCHITECTURE
  • 18. SEPARATE STORAGE AND QUERY LAYERS Combine best of breed storage and query platforms Take full advantage of evolution of each Storage handles replication for availability Query can replicate data for scaling read concurrency - independent!
  • 19. SCALE NODES, NOT DEVELOPER TIME!!
  • 20. KEEPING IT SIMPLE Maximize row scan speed Columnar representation for efficiency Compressed bitmap indexes for fast algebra Functional transforms for easy memoization, testing, concurrency, composition
  • 22. EVEN BETTER: TACHYON OFF-HEAP CACHING
  • 23. INITIAL ATTEMPTS val rows = Seq( Seq("Burglary", "19xx Hurston", 10), Seq("Theft", "55xx Floatilla Ave", 5) ) sc.parallelize(rows) .map { values => (values[0], values) } .groupByKey .reduce(_[2] + _[2])
  • 24. No existing generic query engine for Spark when we started (Shark was in infancy, had no indexes, etc.), so we built our own For every row, need to extract out needed columns Ability to select arbitrary columns means using Seq[Any], no type safety Boxing makes integer aggregation very expensive and memory inefficient
  • 26. The traditional row-based data storage approach is dead - Michael Stonebraker
  • 27. TRADITIONAL ROW-BASED STORAGE Same layout in memory and on disk: Name Age Barak 46 Hillary 66 Each row is stored contiguously. All columns in row 2 come after row 1.
  • 28. COLUMNAR STORAGE (MEMORY) Name column 0 1 0 1 Dictionary: {0: "Barak", 1: "Hillary"} Age column 0 1 46 66
  • 29. COLUMNAR STORAGE (CASSANDRA) Review: each physical row in Cassandra (e.g. a "partition key") stores its columns together on disk. Schema CF Rowkey Type Name StringDict Age Int Data CF Rowkey 0 1 Name 0 1 Age 46 66
  • 30. ADVANTAGES OF COLUMNAR STORAGE Compression Dictionary compression - HUGE savings for low-cardinality string columns RLE Reduce I/O Only columns needed for query are loaded from disk Can keep strong types in memory, avoid boxing Batch multiple rows in one cell for efficiency
  • 31. ADVANTAGES OF COLUMNAR QUERYING Cache locality for aggregating column of data Take advantage of CPU/GPU vector instructions for ints / doubles avoid row-ifying until last possible moment easy to derive computed columns Use vector data / linear math libraries
  • 32. COLUMNAR QUERY ENGINE VS ROW-BASED IN SCALA Custom RDD of column-oriented blocks of data Uses ~10x less heap 10-100x faster for group by's on a single node Scan speed in excess of 150M rows/sec/core for integer aggregations
  • 33. SO, GREAT, OLAP WITH CASSANDRA AND SPARK. NOW WHAT?
  • 35. DATASTAX: CASSANDRA SPARK INTEGRATION Datastax Enterprise now comes with HA Spark HA master, that is. spark-cassandra-connector
  • 36. SPARK SQL Appeared with Spark 1.0 In-memory columnar store Can read from Parquet and JSON now; direct Cassandra integration coming Querying is not column-based (yet) No indexes Write custom functions in Scala .... take that Hive UDFs!! Integrates well with MLBase, Scala/Java/Python
  • 37. CACHING A SQL TABLE FROM CASSANDRA val sqlContext = new org.apache.spark.sql.SQLContext(sc) sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") .registerAsTable("gdelt") sqlContext.cacheTable("gdelt") sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the collect() In Spark 1.1+: registerTempTable
  • 38. SOME PERFORMANCE NUMBERS GDELT dataset, 117 million rows, 57 columns, ~50GB Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory Query Avg time (sec) SELECT count(*) FROM gdelt WHERE Actor2CountryCode = 'CHN' 0.49 SELECT 4 columns Top K 1.51 SELECT Top countries by Avg Tone 2.69 (Group By)
  • 39. IMPORTANT - CACHING By default, queries will read data from source - Cassandra - every time Spark RDD Caching - much faster, but big waste of memory (row oriented) Spark SQL table caching - fastest, memory efficient
  • 40. WORK STILL NEEDED Indexes Columnar querying for fast aggregation Tachyon support for Cassandra/CQL Efficient reading from columnar storage formats
  • 41. LESSONS Extremely fast distributed querying for these use cases Data doesn't change much (and only bulk changes) Analytical queries for subset of columns Focused on numerical aggregations Small numbers of group bys For fast query performance, cache your data using Spark SQL Concurrent queries is a frontier with Spark. Use additional Spark contexts.
  • 44. EXAMPLE CUSTOM INTEGRATION USING ASTYANAX val cassRDD = sc.parallelize(rowkeys). flatMap { rowkey => columnFamily.get(rowkey).execute().asScala }
  • 45. SOME COLUMNAR ALTERNATIVES Monetdb and Infobright - true columnar stores (storage + querying) Vertica and C-Store Google BigQuery - columnar cloud database, Dremel based Amazon RedShift