SlideShare a Scribd company logo
Intro to

Cassandra
  Tyler Hobbs
History


Dynamo                     BigTable
(clustering)               (data model)




               Cassandra
Users
Clustering

    Every node plays the same role
    – No masters, slaves, or special nodes
    – No single point of failure
Consistent Hashing

           0

     50          10




     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0

     50          10




     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0
                      md5(“www.google.com”)
     50          10

                               14

     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0
                      md5(“www.google.com”)
     50          10

                               14

     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0
                      md5(“www.google.com”)
     50          10

                               14

     40          20

           30
Consistent Hashing
                        Key: “www.google.com”
           0
                        md5(“www.google.com”)
     50          10

                                   14

     40          20

           30
                Replication Factor = 3
Clustering

    Client can talk to any node
Scaling

RF = 2             0


              50        10

The node at
50 owns the
red portion             20

                   30
Scaling

RF = 2               0


                50        10



   Add a new    40        20
   node at 40
                     30
Scaling

RF = 2               0


                50        10



   Add a new    40        20
   node at 40
                     30
Node Failures

RF = 2               0


                50        10

   Replicas
                40        20

                     30
Node Failures

RF = 2               0


                50        10

   Replicas
                40        20

                     30
Node Failures

RF = 2               0


                50        10




                40        20

                     30
Consistency, Availability

    Consistency
    – Can I read stale data?

    Availability
    – Can I write/read at all?

    Tunable Consistency
Consistency

    N = Total number of replicas

    R = Number of replicas read from
    – (before the response is returned)

    W = Number of replicas written to
    – (before the write is considered a success)
Consistency

    N = Total number of replicas

    R = Number of replicas read from
    – (before the response is returned)

    W = Number of replicas written to
    – (before the write is considered a success)


    W + R > N gives strong consistency
Consistency
 W + R > N gives strong consistency

 N=3
 W=2
 R=2

 2 + 2 > 3 ==> strongly consistent
Consistency
 W + R > N gives strong consistency

 N=3
 W=2
 R=2

 2 + 2 > 3 ==> strongly consistent

 Only 2 of the 3 replicas must be
 available.
Consistency

    Tunable Consistency
    – Specify N (Replication Factor) per data set
    – Specify R, W per operation
Consistency

    Tunable Consistency
    – Specify N (Replication Factor) per data set
    – Specify R, W per operation
    – Quorum: N/2 + 1
       • R = W = Quorum
       • Strong consistency
       • Tolerate the loss of N – Quorum replicas
    – R, W can also be 1 or N
Availability

    Can tolerate the loss of:
    – N – R replicas for reads
    – N – W replicas for writes
CAP Theorem
During node or network failure:



          100%
                                          Not
                                          Possible

   Availability
                     Possible




                     Consistency   100%
CAP Theorem
During node or network failure:



          100%
                                                 Not
                            Ca                   Possible
                              ss
                                an
                                   dr
   Availability                       a
                     Possible




                     Consistency          100%
Clustering

    No single point of failure

    Replication that works

    Scales linearly
    – 2x nodes = 2x performance
       • For both writes and reads
    – Up to 100's of nodes

    Operationally simple

    Multi-Datacenter Replication
Data Model

    Comes from Google BigTable

    Goals
    – Minimize disk seeks
    – High throughput
    – Low latency
    – Durable
Data Model

    Keyspace
    – A collection of Column Families
    – Controls replication settings

    Column Family
    – Kinda resembles a table
Column Families

    Static
    – Object data
    – Similar to a table in a relational database

    Dynamic
    – Pre-calculated query results
    – Materialized views
Static Column Families
                   Users
   zznate    password: *    name: Nate


   driftx    password: *   name: Brandon


   thobbs    password: *    name: Tyler


   jbellis   password: *   name: Jonathan   site: riptano.com
Dynamic Column Families

    Rows
    – Each row has a unique primary key
    – Sorted list of (name, value) tuples
       • Like a sorted map or dictionary
    – The (name, value) tuple is called a “column”
Dynamic Column Families
                     Following
zznate    driftx:   thobbs:


driftx


thobbs    zznate:


jbellis   driftx:   mdennis:   pcmanus   thobbs:   xedin:   zznate
Dynamic Column Families

    Column Timestamps
    – Each column (tuple) has a timestamp
    – In the case of a collision, the latest timestamp wins
    – Client specifies timestamp with write
    – Writes are idempotent
       • Infinite retries allowed
Dynamic Column Families

    Other Examples:
    – Timeline of tweets by a user
    – Timeline of tweets by all of the people a user is
      following
    – List of comments sorted by score
    – List of friends grouped by state
The Data API

    Two choices
    – RPC-based API
    – CQL
       • Cassandra Query Language
Inserting Data
 INSERT INTO users (KEY, “name”, “age”)
     VALUES (“thobbs”, “Tyler”, 24);
Updating Data
 Updates are the same as inserts:
 INSERT INTO users (KEY, “age”)
     VALUES (“thobbs”, 34);


 Or
 UPDATE users SET “age” = 34
     WHERE KEY = “thobbs”;
Fetching Data
 Whole row select:
 SELECT * FROM users WHERE KEY = “thobbs”;
Fetching Data
 Explicit column select:
 SELECT “name”, “age” FROM users
     WHERE KEY = “thobbs”;
Fetching Data
 Get a slice of columns
 UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e'
     WHERE KEY = “key”;

 SELECT 1..3 FROM letters WHERE KEY = “key”;


 Returns [(1, a), (2, b), (3, c)]
Fetching Data
 Get a slice of columns
 SELECT FIRST 2 FROM letters WHERE KEY = “key”;


 Returns [(1, a), (2, b)]

 SELECT FIRST 2 REVERSED FROM letters
     WHERE KEY = “key”;


 Returns [(5, e), (4, d)]
Fetching Data
 Get a slice of columns
 SELECT 3..'' FROM letters WHERE KEY = “key”;


 Returns [(3, c), (4, d), (5, e)]

 SELECT FIRST 2 REVERSED 4..'' FROM letters
     WHERE KEY = “key”;


 Returns [(4, d), (3, c)]
Deleting Data
 Delete a whole row:
 DELETE FROM users WHERE KEY = “thobbs”;

 Delete specific columns:
 DELETE “age” FROM users
     WHERE KEY = “thobbs”;
Secondary Indexes
 Builtin basic indexes
 CREATE INDEX ageIndex ON users (age);

 SELECT name FROM USERS
     WHERE age = 24 AND state = “TX”;
Performance

    Writes
    – 10k – 30k per second per node
    – Sub-millisecond latency

    Reads
    – 1k – 10k per second per node
    – Depends on data set, caching
    – Usually 0.1 to 10ms latency
Other Features

    Distributed Counters
    – Can support millions of high-volume counters

    Excellent Multi-datacenter Support
    – Disaster recovery
    – Locality

    Hadoop Integration
    – Isolation of resources
    – Hive and Pig drivers

    Compression
What Cassandra Can't Do

    Transactions
    – Unless you use a distributed lock
    – Atomicity, Isolation
    – These aren't needed as often as you'd think

    Limited support for ad-hoc queries
    – Know what you want to do with the data
Not One-size-fits-all

    Use alongside an RDBMS
    – Use the RDBMS for highly-transactional or highly-
      relational data
       • Usually a small set of data
    – Let Cassandra scale to handle the rest
Language Support

    Good:
    – Java
    – Python
    – Ruby
    – PHP
    – C#

    Coming Soon:
    – Everything else, now that we have CQL
Questions?

          Tyler Hobbs
               @tylhobbs
       tyler@datastax.com

More Related Content

PDF
Cassandra for Python Developers
PDF
Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)
PDF
MySQL's JSON Data Type and Document Store
PDF
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
ODP
Beyond PHP - It's not (just) about the code
ODP
Caching and tuning fun for high scalability
PDF
MySQL 5.7 NF – JSON Datatype 활용
PDF
Pdxpugday2010 pg90
Cassandra for Python Developers
Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)
MySQL's JSON Data Type and Document Store
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Beyond PHP - It's not (just) about the code
Caching and tuning fun for high scalability
MySQL 5.7 NF – JSON Datatype 활용
Pdxpugday2010 pg90

What's hot (20)

PDF
groovy databases
PDF
SunshinePHP 2017 - Making the most out of MySQL
PPTX
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
PDF
Spock and Geb in Action
PDF
Cassandra 2.1
PPTX
MongoDB London 2013: Basic Replication in MongoDB presented by Marc Schwering...
PDF
ODP
Beyond PHP - it's not (just) about the code
ODP
Caching and tuning fun for high scalability
PDF
اسلاید اول جلسه چهارم کلاس پایتون برای هکرهای قانونی
PDF
The Ring programming language version 1.10 book - Part 56 of 212
PDF
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
PPTX
Getting started with replica set in MongoDB
PDF
The Ring programming language version 1.5.2 book - Part 45 of 181
PDF
The ABCs of OTP
PDF
Graph Connect: Importing data quickly and easily
PDF
Cassandra summit keynote 2014
DOCX
Materi my sql part 1
PDF
Cassandra introduction @ ParisJUG
PPTX
Introduction databases and MYSQL
groovy databases
SunshinePHP 2017 - Making the most out of MySQL
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
Spock and Geb in Action
Cassandra 2.1
MongoDB London 2013: Basic Replication in MongoDB presented by Marc Schwering...
Beyond PHP - it's not (just) about the code
Caching and tuning fun for high scalability
اسلاید اول جلسه چهارم کلاس پایتون برای هکرهای قانونی
The Ring programming language version 1.10 book - Part 56 of 212
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
Getting started with replica set in MongoDB
The Ring programming language version 1.5.2 book - Part 45 of 181
The ABCs of OTP
Graph Connect: Importing data quickly and easily
Cassandra summit keynote 2014
Materi my sql part 1
Cassandra introduction @ ParisJUG
Introduction databases and MYSQL
Ad

Viewers also liked (11)

PPTX
SC 2015: Thinking Fast and Slow with Software Development
PDF
Detect all memory leaks with LeakCanary!
PDF
How Yelp Uses Sensu to Monitor Services in a SOA World
PDF
Evolving the Netflix API
PDF
Datomic – A Modern Database - StampedeCon 2014
PDF
7 Common Mistakes in Go (2015)
PDF
How to name things: the hardest problem in programming
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Patterns for building resilient and scalable microservices platform on AWS
PDF
Understanding Data Partitioning and Replication in Apache Cassandra
PDF
The data model is dead, long live the data model
SC 2015: Thinking Fast and Slow with Software Development
Detect all memory leaks with LeakCanary!
How Yelp Uses Sensu to Monitor Services in a SOA World
Evolving the Netflix API
Datomic – A Modern Database - StampedeCon 2014
7 Common Mistakes in Go (2015)
How to name things: the hardest problem in programming
Cassandra @ Sony: The good, the bad, and the ugly part 1
Patterns for building resilient and scalable microservices platform on AWS
Understanding Data Partitioning and Replication in Apache Cassandra
The data model is dead, long live the data model
Ad

Similar to Intro to Cassandra (20)

PDF
Cassandra for Ruby/Rails Devs
KEY
Cassandra and Rails at LA NoSQL Meetup
PDF
Slide presentation pycassa_upload
PDF
NoSQL Smackdown!
PPTX
Apache Cassandra, part 1 – principles, data model
PDF
Ben Coverston - The Apache Cassandra Project
PDF
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
PPT
NOSQL and Cassandra
KEY
Taming Cassandra
PDF
Cassandra Fundamentals - C* 2.0
PPTX
NoSql Database
PPTX
Rich placement constraints: Who said YARN cannot schedule services?
PPTX
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
PDF
Renegotiating the boundary between database latency and consistency
PPT
Scaling web applications with cassandra presentation
PDF
Introduction to Cassandra
PDF
Cassandra Tutorial
PPTX
The Other HPC: High Productivity Computing in Polystore Environments
PPTX
Cassandra 2012 scandit
PPTX
Netcetera
Cassandra for Ruby/Rails Devs
Cassandra and Rails at LA NoSQL Meetup
Slide presentation pycassa_upload
NoSQL Smackdown!
Apache Cassandra, part 1 – principles, data model
Ben Coverston - The Apache Cassandra Project
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
NOSQL and Cassandra
Taming Cassandra
Cassandra Fundamentals - C* 2.0
NoSql Database
Rich placement constraints: Who said YARN cannot schedule services?
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Renegotiating the boundary between database latency and consistency
Scaling web applications with cassandra presentation
Introduction to Cassandra
Cassandra Tutorial
The Other HPC: High Productivity Computing in Polystore Environments
Cassandra 2012 scandit
Netcetera

Recently uploaded (20)

PPTX
Machine Learning_overview_presentation.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mushroom cultivation and it's methods.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Getting Started with Data Integration: FME Form 101
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Machine Learning_overview_presentation.pptx
cloud_computing_Infrastucture_as_cloud_p
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mushroom cultivation and it's methods.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation theory and applications.pdf
OMC Textile Division Presentation 2021.pptx
Heart disease approach using modified random forest and particle swarm optimi...
Getting Started with Data Integration: FME Form 101
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Weekly Chronicles - August'25-Week II
A comparative analysis of optical character recognition models for extracting...
Per capita expenditure prediction using model stacking based on satellite ima...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Group 1 Presentation -Planning and Decision Making .pptx
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...

Intro to Cassandra

  • 1. Intro to Cassandra Tyler Hobbs
  • 2. History Dynamo BigTable (clustering) (data model) Cassandra
  • 4. Clustering  Every node plays the same role – No masters, slaves, or special nodes – No single point of failure
  • 5. Consistent Hashing 0 50 10 40 20 30
  • 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30
  • 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  • 11. Clustering  Client can talk to any node
  • 12. Scaling RF = 2 0 50 10 The node at 50 owns the red portion 20 30
  • 13. Scaling RF = 2 0 50 10 Add a new 40 20 node at 40 30
  • 14. Scaling RF = 2 0 50 10 Add a new 40 20 node at 40 30
  • 15. Node Failures RF = 2 0 50 10 Replicas 40 20 30
  • 16. Node Failures RF = 2 0 50 10 Replicas 40 20 30
  • 17. Node Failures RF = 2 0 50 10 40 20 30
  • 18. Consistency, Availability  Consistency – Can I read stale data?  Availability – Can I write/read at all?  Tunable Consistency
  • 19. Consistency  N = Total number of replicas  R = Number of replicas read from – (before the response is returned)  W = Number of replicas written to – (before the write is considered a success)
  • 20. Consistency  N = Total number of replicas  R = Number of replicas read from – (before the response is returned)  W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency
  • 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
  • 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available.
  • 23. Consistency  Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation
  • 24. Consistency  Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N
  • 25. Availability  Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes
  • 26. CAP Theorem During node or network failure: 100% Not Possible Availability Possible Consistency 100%
  • 27. CAP Theorem During node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100%
  • 28. Clustering  No single point of failure  Replication that works  Scales linearly – 2x nodes = 2x performance • For both writes and reads – Up to 100's of nodes  Operationally simple  Multi-Datacenter Replication
  • 29. Data Model  Comes from Google BigTable  Goals – Minimize disk seeks – High throughput – Low latency – Durable
  • 30. Data Model  Keyspace – A collection of Column Families – Controls replication settings  Column Family – Kinda resembles a table
  • 31. Column Families  Static – Object data – Similar to a table in a relational database  Dynamic – Pre-calculated query results – Materialized views
  • 32. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  • 33. Dynamic Column Families  Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like a sorted map or dictionary – The (name, value) tuple is called a “column”
  • 34. Dynamic Column Families Following zznate driftx: thobbs: driftx thobbs zznate: jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate
  • 35. Dynamic Column Families  Column Timestamps – Each column (tuple) has a timestamp – In the case of a collision, the latest timestamp wins – Client specifies timestamp with write – Writes are idempotent • Infinite retries allowed
  • 36. Dynamic Column Families  Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state
  • 37. The Data API  Two choices – RPC-based API – CQL • Cassandra Query Language
  • 38. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
  • 39. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
  • 40. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
  • 41. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
  • 42. Fetching Data Get a slice of columns UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e' WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)]
  • 43. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)]
  • 44. Fetching Data Get a slice of columns SELECT 3..'' FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4..'' FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)]
  • 45. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”;
  • 46. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
  • 47. Performance  Writes – 10k – 30k per second per node – Sub-millisecond latency  Reads – 1k – 10k per second per node – Depends on data set, caching – Usually 0.1 to 10ms latency
  • 48. Other Features  Distributed Counters – Can support millions of high-volume counters  Excellent Multi-datacenter Support – Disaster recovery – Locality  Hadoop Integration – Isolation of resources – Hive and Pig drivers  Compression
  • 49. What Cassandra Can't Do  Transactions – Unless you use a distributed lock – Atomicity, Isolation – These aren't needed as often as you'd think  Limited support for ad-hoc queries – Know what you want to do with the data
  • 50. Not One-size-fits-all  Use alongside an RDBMS – Use the RDBMS for highly-transactional or highly- relational data • Usually a small set of data – Let Cassandra scale to handle the rest
  • 51. Language Support  Good: – Java – Python – Ruby – PHP – C#  Coming Soon: – Everything else, now that we have CQL
  • 52. Questions? Tyler Hobbs @tylhobbs [email protected]