SlideShare a Scribd company logo
Lorenzo Alberton
                            @lorenzoalberton




 NoSQL Databases:
Why, what and when
      NoSQL Databases Demystified




      PHP UK Conference, 25th February 2011
                                               1
NoSQL: Why
Scalability, Concurrency, New trends




                                       2
New Trends



 2002   2004   2006   2008   2010   2012



           Big data




                                           3
New Trends



 2002   2004   2006   2008   2010   2012



           Big data




   Concurrency
                                           3
New Trends



 2002   2004   2006   2008   2010   2012



           Big data                        Connectivity




   Concurrency
                                                          3
New Trends



 2002   2004   2006   2008   2010   2012



           Big data                        Connectivity




   Concurrency                              Diversity
                                                          3
New Trends



 2002   2004   2006   2008   2010   2012



           Big data                        Connectivity   P2P Knowledge




   Concurrency                              Diversity
                                                                          3
New Trends



 2002   2004   2006   2008   2010   2012



           Big data                        Connectivity   P2P Knowledge




   Concurrency                              Diversity      Cloud-Grid
                                                                          3
What’s the problem with RDBMS’s?




  http://www.codefutures.com/database-sharding


                                                 4
What’s the problem with RDBMS’s?
                                                 Caching
                                                 Master/Slave
                                                 Master/Master
                                                 Cluster
                                                 Table Partitioning
                                                 Federated Tables
                                                 Sharding
  http://www.codefutures.com/database-sharding   Distributed DBs
                                                                      4
What’s the problem with RDBMS’s?




                                   5
What’s the problem with RDBMS’s?




http://www.flickr.com/photos/dimi3/3096166092   5
Quick Comparison
            Overview from 10,000 feet
  (random impressions from the interwebs)




                                            6
Quick Comparison
            Overview from 10,000 feet
  (random impressions from the interwebs)




           http://www.flickr.com/photos/42433826@N00/4914337851




                                                                 6
MongoDB is web-scale




        ...but /dev/null
          is even better!
                            7
Cassandra is teh schnitz




        ..Love,v/null
          .but /de
          is even better
                           8
CouchDB: Relax!




         .buve/a LO n ?
             t de      ulfree space
       ..Love,v/T of!l !?
      You harenaieth?r Right?
               l, r x te?
         isyevay b g t
        an  we
                                      9
No, seriously...*

(*) Not another “Mine is bigger” comparison, please

                                                      10
A little theory
      Fundamental Principles
  of (Distributed) Databases




     http://www.timbarcz.com/blog/PassionInProgrammers.aspx



                                                              11
ACID

ATOMICITY: All or nothing
CONSISTENCY: Any transaction will take the db from one
consistent state to another, with no broken constraints
(referential integrity)

ISOLATION: Other operations cannot access data that has
been modified during a transaction that has not yet completed

DURABILITY: Ability to recover the committed transaction
updates against any kind of system failure (transaction log)
                                                               12
Isolation Levels, Locking & MVCC
Isolation                   noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations




                                                                  13
Isolation Levels, Locking & MVCC
Isolation                    noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations

  SERIALIZABLE
  All transactions occur in a
  completely isolated fashion, as
  if they were executed serially




                                                                  13
Isolation Levels, Locking & MVCC
Isolation                    noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations

  SERIALIZABLE                      REPEATABLE READ
  All transactions occur in a       Multiple SELECT statements
  completely isolated fashion, as   issued in the same transaction
  if they were executed serially    will always yield the same
                                    result




                                                                  13
Isolation Levels, Locking & MVCC
Isolation                    noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations

  SERIALIZABLE                      REPEATABLE READ
  All transactions occur in a       Multiple SELECT statements
  completely isolated fashion, as   issued in the same transaction
  if they were executed serially    will always yield the same
                                    result
  READ COMMITTED
  A lock is acquired only on the
  rows currently read/updated

                                                                  13
Isolation Levels, Locking & MVCC
Isolation                    noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations

  SERIALIZABLE                      REPEATABLE READ
  All transactions occur in a       Multiple SELECT statements
  completely isolated fashion, as   issued in the same transaction
  if they were executed serially    will always yield the same
                                    result
  READ COMMITTED                    READ UNCOMMITTED
  A lock is acquired only on the    A transaction can access
  rows currently read/updated       uncommitted changes made
                                    by other transactions
                                                                  13
Isolation Levels, Locking & MVCC
                                            Non-repeatable
Isolation Level       Dirty Reads                                        Phantoms
                                                reads

Serializable                  -                        -                       -

Repeatable
                              -                        -
Read
Read
                              -
Committed
Read
Uncommitted

           http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels/        14
Isolation Levels, Locking & MVCC

Isolation Level   Range Lock   Read Lock   Write Lock


Serializable

Repeatable
                                               -
Read
Read
                                   -           -
Committed
Read
                      -            -           -
Uncommitted

                                                        15
Multi-Version Concurrency Control
                                            Root


                                    Index


         Index                      Index                 Index


Index         Index      Index        Index        Index          Index      Index
Data

       Data

               Data

                      Data

                             Data

                                     Data

                                            Data

                                                   Data

                                                           Data

                                                                   Data

                                                                          Data

                                                                                 Data
                                                                                        16
Multi-Version Concurrency Control
       obsolete                             Root
       new version

                                    Index                 Index


         Index                      Index                  Index                 Index


Index         Index      Index        Index        Index          Index      Index          Index
Data

       Data

               Data

                      Data

                             Data

                                     Data

                                            Data

                                                   Data

                                                           Data

                                                                   Data

                                                                          Data

                                                                                 Data



                                                                                         Data
                                                                                                16
Multi-Version Concurrency Control
       obsolete                             Root                                 atomic pointer update
       new version

                                    Index                 Index


         Index                      Index                  Index                    Index


Index         Index      Index        Index        Index          Index      Index                  Index
Data

       Data

               Data

                      Data

                             Data

                                     Data

                                            Data

                                                   Data

                                                           Data

                                                                   Data

                                                                          Data

                                                                                   Data



                                                                                                 Data
                                                                                                         16
Multi-Version Concurrency Control
       obsolete                             Root                                 atomic pointer update
       new version                                                                          marked for
                                                                                            compaction
                                    Index                 Index


         Index                      Index                  Index                    Index


Index         Index      Index        Index        Index          Index      Index                  Index
Data

       Data

               Data

                      Data

                             Data

                                     Data

                                            Data

                                                   Data

                                                           Data

                                                                   Data

                                                                          Data

                                                                                   Data



                                                                                                 Data
                                                                                                         16
Multi-Version Concurrency Control
       obsolete                             Root                                 atomic pointer update
       new version                                                                          marked for
                                                                                            compaction
                                    Index                 Index
                                                                                                Reads:
                                                                                                 never
                                                                                                blocked
         Index                      Index                  Index                    Index


Index         Index      Index        Index        Index          Index      Index                  Index
Data

       Data

               Data

                      Data

                             Data

                                     Data

                                            Data

                                                   Data

                                                           Data

                                                                   Data

                                                                          Data

                                                                                   Data



                                                                                                 Data
                                                                                                          16
Distributed Transactions - 2PC
       Coordinator
                          1) COMMIT
                             REQUEST
                             PHASE
                            (voting phase)




       Participants



                                         17
Distributed Transactions - 2PC
       Coordinator
                                        1) COMMIT
                                           REQUEST
                                           PHASE
                                         (voting phase)
                      Query to commit




       Participants



                                                      17
Distributed Transactions - 2PC
              Coordinator
                                                1) COMMIT
                                                   REQUEST
                                                   PHASE
                                                 (voting phase)




              Participants
 1) Exec Transaction up to the COMMIT request
 2) Write entry to undo and redo logs
                                                              17
Distributed Transactions - 2PC
       Coordinator
                              1) COMMIT
                                 REQUEST
                                 PHASE
                               (voting phase)
                      Agree
                       or
                      Abort




       Participants



                                            17
Distributed Transactions - 2PC
       Coordinator
                          2) COMMIT
                             PHASE
                            (completion
                             phase)

                          a) SUCCESS
                             (agreement
                              from all)



       Participants



                                          18
Distributed Transactions - 2PC
       Coordinator
                               2) COMMIT
                                  PHASE
                                 (completion
                                  phase)
                      Commit
                               a) SUCCESS
                                  (agreement
                                   from all)



       Participants



                                               18
Distributed Transactions - 2PC
        Coordinator
                             2) COMMIT
                                PHASE
                               (completion
                                phase)

                             a) SUCCESS
                                (agreement
                                 from all)



        Participants
     1) Complete operation
     2) Release locks
                                             18
Distributed Transactions - 2PC
       Coordinator
                                    2) COMMIT
                                       PHASE
                                      (completion
                                       phase)
                      Acknowledge
                                    a) SUCCESS
                                       (agreement
                                        from all)



       Participants



                                                    18
Distributed Transactions - 2PC
       Coordinator
                                             2) COMMIT
                                                PHASE
                      Complete transaction     (completion
                                                phase)

                                             a) SUCCESS
                                                (agreement
                                                 from all)



       Participants



                                                             18
Distributed Transactions - 2PC
       Coordinator
                          2) COMMIT
                             PHASE
                            (completion
                             phase)

                          b) FAILURE
                             (abort from
                              any)



       Participants



                                           19
Distributed Transactions - 2PC
       Coordinator
                                 2) COMMIT
                                    PHASE
                                   (completion
                                    phase)
                      Rollback
                                 b) FAILURE
                                    (abort from
                                     any)



       Participants



                                                  19
Distributed Transactions - 2PC
        Coordinator
                          2) COMMIT
                             PHASE
                            (completion
                             phase)

                          b) FAILURE
                             (abort from
                              any)



        Participants
     1) Undo operation
     2) Release locks
                                           19
Distributed Transactions - 2PC
       Coordinator
                                    2) COMMIT
                                       PHASE
                                      (completion
                                       phase)
                      Acknowledge
                                    b) FAILURE
                                       (abort from
                                        any)



       Participants



                                                     19
Distributed Transactions - 2PC
       Coordinator
                                         2) COMMIT
                                            PHASE
                      Undo transaction     (completion
                                            phase)

                                         b) FAILURE
                                            (abort from
                                             any)



       Participants



                                                          19
Problems with 2PC



                   Blocking Protocol




Risk of indefinite cohort      Conservative behaviour
blocks if coordinator fails   biased to the abort case

                                                         20
Paxos Algorithm (Consensus)
 Family of Fault-tolerant, distributed implementations
 Spectrum of trade-offs:
  Number of processors
  Number of message delays
  Activity level of participants
  Number of messages sent
  Types of failures
                                         http://www.usenix.org/event/nsdi09/tech/full_papers/yabandeh/yabandeh_html/




                      http://en.wikipedia.org/wiki/Paxos_algorithm                                              21
[PSE image alert]


                    22
ACID & Distributed Systems




    http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb   23
ACID & Distributed Systems


         ACID properties are always desirable

                           But what about:
                            Latency
                            Partition Tolerance
                            High Availability
                                               ?

    http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb   23
CAP Theorem (Brewer’s conjecture)
 2000 Prof. Eric Brewer, PoDC Conference Keynote
 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)



                     Of three properties of shared-data systems -
                     data Consistency, system Availability and
                     tolerance to network Partitions - only two can
                     be achieved at any given moment in time.



http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf   http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf

                                                                                                                                 24
CAP Theorem (Brewer’s conjecture)
 2000 Prof. Eric Brewer, PoDC Conference Keynote
 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)



                     Of three properties of shared-data systems -
                     data Consistency, system Availability and
                     tolerance to network Partitions - only two can
                     be achieved at any given moment in time.



http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf   http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf

                                                                                                                                 24
Partition Tolerance - Availability
 “The network will be allowed to lose arbitrarily many messages
 sent from one node to another” [...]
 “For a distributed system to be continuously available, every
 request received by a non-failing node in the system must result
 in a response”                    - Gilbert and Lynch, SIGACT 2002




http://codahale.com/you-cant-sacrifice-partition-tolerance   http://pl.atyp.us/wordpress/?p=2521   25
Partition Tolerance - Availability
 “The network will be allowed to lose arbitrarily many messages
 sent from one node to another” [...]
 “For a distributed system to be continuously available, every
 request received by a non-failing node in the system must result
 in a response”                    - Gilbert and Lynch, SIGACT 2002




http://codahale.com/you-cant-sacrifice-partition-tolerance   http://pl.atyp.us/wordpress/?p=2521   25
Partition Tolerance - Availability
 “The network will be allowed to lose arbitrarily many messages
 sent from one node to another” [...]
 “For a distributed system to be continuously available, every
 request received by a non-failing node in the system must result
 in a response”                    - Gilbert and Lynch, SIGACT 2002

                                             CP: requests can complete at nodes
                                             that have quorum
                                             AP: requests can complete at any
                                             live node, possibly violating strong
                                             consistency

http://codahale.com/you-cant-sacrifice-partition-tolerance   http://pl.atyp.us/wordpress/?p=2521   25
Partition Tolerance - Availability
 “The network will be allowed to lose arbitrarily many messages
 sent from one node to another” [...]
 “For a distributed system to be continuously available, every
 request received by a non-failing node in the system must result
 in a response”                    - Gilbert and Lynch, SIGACT 2002

                                             CP: requests can complete at nodes
                                             that have quorum
                                                    HIGH LATENCY
                                             AP: requests can complete at any     ≈
                                             live node, possiblyPARTITION
                                                NETWORK violating strong
                                             consistency
                                               http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html


http://codahale.com/you-cant-sacrifice-partition-tolerance         http://pl.atyp.us/wordpress/?p=2521                             25
Consistency: Client-side view
A service that is consistent operates fully or not at all.
  Strong consistency (as in ACID)
  Weak consistency (no guarantee) - Inconsistency window




                                                               (*) Temporary inconsistencies
                                                                   (e.g. in data constraints or
                                                                   replica versions) are
                                                                   accepted, but they’re resolved
                                                                   at the earliest opportunity
          http://www.allthingsdistributed.com/2008/12/eventually_consistent.html              26
Consistency: Client-side view
A service that is consistent operates fully or not at all.
  Strong consistency (as in ACID)
  Weak consistency (no guarantee) - Inconsistency window
      Eventual* consistency (e.g. DNS)
         Causal consistency
         Read-your-writes consistency
         (the least surprise)
         Session consistency                                   (*) Temporary inconsistencies
                                                                   (e.g. in data constraints or
         Monotonic read consistency                                replica versions) are
                                                                   accepted, but they’re resolved
         Monotonic write consistency                               at the earliest opportunity
          http://www.allthingsdistributed.com/2008/12/eventually_consistent.html              26
Consistency: Server-side (Quorum)
N = number of nodes with a replica of the data
                                                         (*)
W = number of replicas that must acknowledge the update
R = minimum number of replicas that must participate in a
    successful read operation
                        (*) but the data will be written to N nodes no matter what


W+R>N               Strong consistency (usually N=3, W=R=2)
W = N, R =1         Optimised for reads
W = 1, R = N        Optimised for writes
                    (durability not guaranteed in presence of failures)


W + R <= N          Weak consistency
                                                                                     27
Amazon Dynamo Paper
 Consistent Hashing


 Vector Clocks


 Gossip Protocol


 Hinted Handoffs


 Read Repair

http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf   28
Modulo-based Hashing

    N1      N2         N3   N4




                                 29
Modulo-based Hashing

    N1      N2         N3   N4

                 ?




                                 29
Modulo-based Hashing

    N1           N2        N3          N4




         partition = key % n_servers




                                            29
Modulo-based Hashing

    N1           N2        N3        N4




         partition = key % n_servers - 1)
                           (n_servers




                                            29
Modulo-based Hashing

      N1                N2              N3             N4




           partition = key % n_servers - 1)
                             (n_servers


  Recalculate the hashes for all the entries if n_servers changes
  (i.e. full data redistribution when adding/removing a node)
                                                                    29
Consistent Hashing
                2160    0

                            A



     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C
                                                                         available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing
                2160    0

                            A



     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C
                                                                         available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing
                2160    0

                            A

                                                     canonical home
                                                  (coordinator node)
                                                   for key range A-B
     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C
                                                                         available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing
                2160    0

                            A



     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C
                                                                         available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing
                2160    0

                            A



     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C       canonical home
                                      for key range A-C                  available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing
                2160    0
                                                                  only the keys in this
                                                                  range change location
                            A



     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C       canonical home
                                      for key range A-C                  available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing - Replication

                       A



     F                                   B


                Ring
 E           (key space)

         D
                            C



                    http://horicky.blogspot.com/2009/11/nosql-patterns.html   31
Consistent Hashing - Replication
                                               Key hosted
                                                       AB


                       A                        in B, C, D

     F                                   B
                                                                 Data replicated in
                Ring                                             the N-1 clockwise
 E           (key space)                                          successor nodes

         D
                            C                    Node hosting
                                                Key , Key , Key
                                                        FA          AB        BC



                    http://horicky.blogspot.com/2009/11/nosql-patterns.html           31
Consistent Hashing - Node Changes

             A



     F               B



 E




         D
                 C



                                    32
Consistent Hashing - Node Changes
                                                 Key membership
                        A                         and replicas are
                                                  updated when a
     F                          B
                                               node joins or leaves
             Copy Key                              the network.
             Range AB                             The number of
 E
                Copy Key                       replicas for all data
                Range FA                        is kept consistent.
         D
                            C       Copy Key
                                    Range EF
                                                                  32
Consistent Hashing - Load Distribution
                    2160   0
                                               Different Strategies
                               A
             I
                                                      Virtual Nodes
     H                                 B
                                               Random tokens per each
                    Ring                       physical node, partition by
                                           C
 G               (key space)                   token value
                                       D

                                                   Node 1: tokens A, E, G
         F                                         Node 2: tokens C, F, H
                                   E               Node 3: tokens B, D, I



                                                                             33
Consistent Hashing - Load Distribution
         2160   0
                     Different Strategies

                            Virtual Nodes

                     Q equal-sized partitions,
         Ring        S nodes, Q/S tokens per
      (key space)    node (with Q >> S)
                                  Node 1
                                  Node 2
                                  Node 3
                                  Node 4
                                    ...
                                                 34
Vector Clocks & Conflict Detection
A       B      C                            write handled by A
                                                                                      Causality-based partial
                                                                                      order over events that
                              D1 ([A, 1])                                             happen in the system.

                                                                                      Document version
                                                                                      history: a counter for
                                                                                      each node that updated
                                                                                      the document.

                                                                                      If all update counters in
                                                                                      V1 are smaller or equal
                                                                                      to all update counters in
                                                                                      V2, then V1 precedes V2.
    http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601              35
Vector Clocks & Conflict Detection
A       B      C                            write handled by A
                                                                                      Causality-based partial
                                                                                      order over events that
                              D1 ([A, 1])                                             happen in the system.
                                            write handled by A
                                                                                      Document version
                              D2 ([A, 2])                                             history: a counter for
                                                                                      each node that updated
                                                                                      the document.

                                                                                      If all update counters in
                                                                                      V1 are smaller or equal
                                                                                      to all update counters in
                                                                                      V2, then V1 precedes V2.
    http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601              35
Vector Clocks & Conflict Detection
 A       B      C                            write handled by A
                                                                                       Causality-based partial
                                                                                       order over events that
                               D1 ([A, 1])                                             happen in the system.
                                             write handled by A
                                                                                       Document version
                               D2 ([A, 2])                                             history: a counter for
                                                                                       each node that updated
  write handled by B                                 write handled by C
                                                                                       the document.
D3 ([A, 2], [B, 1])                              D4 ([A, 2], [C,1])                    If all update counters in
                                                                                       V1 are smaller or equal
                                                                                       to all update counters in
                                                                                       V2, then V1 precedes V2.
     http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601              35
Vector Clocks & Conflict Detection
 A       B      C                            write handled by A
                                                                                       Causality-based partial
                                                                                       order over events that
                               D1 ([A, 1])                                             happen in the system.
                                             write handled by A
                                                                                       Document version
                               D2 ([A, 2])                                             history: a counter for
                                                                                       each node that updated
  write handled by B                                 write handled by C
                                                                                       the document.
D3 ([A, 2], [B, 1])                              D4 ([A, 2], [C,1])                    If all update counters in
                                                                                       V1 are smaller or equal
       conflict detected                           reconciliation handled by A
                                                                                       to all update counters in
                                          ?                                            V2, then V1 precedes V2.
     http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601              35
Vector Clocks & Conflict Detection
 A       B      C                            write handled by A
                                                                                       Causality-based partial
                                                                                       order over events that
                               D1 ([A, 1])                                             happen in the system.
                                             write handled by A
                                                                                       Document version
                               D2 ([A, 2])                                             history: a counter for
                                                                                       each node that updated
  write handled by B                                 write handled by C
                                                                                       the document.
D3 ([A, 2], [B, 1])                              D4 ([A, 2], [C,1])                    If all update counters in
                                                                                       V1 are smaller or equal
       conflict detected                           reconciliation handled by A
                                                                                       to all update counters in
             D5 ([A, 3], [B, 1], [C,1])
                           ?                                                           V2, then V1 precedes V2.
     http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601              35
Vector Clocks & Conflict Detection
A       B      C                            write handled by A
                                                                                      Vector Clocks can detect
                                                                                      a conflict. The conflict
                              D1 ([A, 1])                                             resolution is left to the
                                                                                      application or the user.

                                                                                      The application might
                                                                                      resolve conflicts by
                                                                                      checking relative
                                                                                      timestamps, or with
                                                                                      other strategies (like
                                                                                      merging the changes).

                                                                                      Vector clocks can grow
                                                                                      quite large (!)
    http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601           36
Vector Clocks & Conflict Detection
A       B      C                            write handled by A
                                                                                      Vector Clocks can detect
                                                                                      a conflict. The conflict
                              D1 ([A, 1])                                             resolution is left to the
                                            write handled by A                        application or the user.

                                                                                      The application might
                              D2 ([A, 2])
                                                                                      resolve conflicts by
                                                                                      checking relative
                                                                                      timestamps, or with
                                                                                      other strategies (like
                                                                                      merging the changes).

                                                                                      Vector clocks can grow
                                                                                      quite large (!)
    http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601           36
Vector Clocks & Conflict Detection
 A       B      C                            write handled by A
                                                                                       Vector Clocks can detect
                                                                                       a conflict. The conflict
                               D1 ([A, 1])                                             resolution is left to the
                                             write handled by A                        application or the user.

                                                                                       The application might
                               D2 ([A, 2])
                                                                                       resolve conflicts by
  write handled by B                                 un-modified replica                checking relative
                                                                                       timestamps, or with
D3 ([A, 2], [B, 1])                                D4 ([A, 2])                         other strategies (like
                                                                                       merging the changes).

                                                                                       Vector clocks can grow
                                                                                       quite large (!)
     http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601           36
Vector Clocks & Conflict Detection
 A       B      C                            write handled by A
                                                                                       Vector Clocks can detect
                                                                                       a conflict. The conflict
                               D1 ([A, 1])                                             resolution is left to the
                                             write handled by A                        application or the user.

                                                                                       The application might
                               D2 ([A, 2])
                                                                                       resolve conflicts by
  write handled by B                                 un-modified replica                checking relative
                                                                                       timestamps, or with
D3 ([A, 2], [B, 1])                                D4 ([A, 2])                         other strategies (like
                                                                                       merging the changes).
       version mismatch                                D3 ⊇ D4, conflict
            detected                                resolved automatically
                                                                                       Vector clocks can grow
                       D5 ([A, 3], [B, 1])                                             quite large (!)
     http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601           36
Gossip Protocol + Hinted Handoff

             A


                         periodic, pairwise,
     F               B     inter-process
                          interactions of
                           bounded size
 E                       among randomly-
                           chosen peers

         D
                 C



                                          37
Gossip Protocol + Hinted Handoff

                                  A


              I can’t see B, it might be                             periodic, pairwise,
     F       down but I need some          B                           inter-process
             ACK. My Merkle Tree
             root for range XY is                                     interactions of
              “ab031dab4a385afda”                                      bounded size
 E                                                                   among randomly-
                 I can’t see B either.
                 My Merkle Tree root for                               chosen peers
                 range XY is different!

         D
                                       C
                                           B must be down
                                           then. Let’s disable it.

                                                                                      37
Gossip Protocol + Hinted Handoff
                     My canonical node is
                     supposed to be B.
             A


                                                   periodic, pairwise,
     F               B                               inter-process
                                                    interactions of
                                                     bounded size
 E                                                 among randomly-
                                                     chosen peers

         D                I see. Well, I’ll take care of it
                         for now, and let B know
                 C
                          when B is available again


                                                                    37
Merkle Trees (Hash Trees)
                                                                             Leaves: hashes of
                               ROOT
                               hash(A, B)                                    data blocks.
                                                                             Nodes: hashes of
                                                                             their children.
             A                                         B
          hash(C, D)                                hash(E, F)
                                                                             Used to detect
                                                                             inconsistencies
    C                   D                    E                      F        between replicas
 hash(001)        hash(002)            hash(003)               hash(004)
                                                                             (anti-entropy) and
                                                                             to minimise the
  Data                 Data                 Data                  Data
  Block                Block                Block                 Block      amount of
  001                  002                  003                   004        transferred data
                                    http://en.wikipedia.org/wiki/Hash_tree                        38
Read Repair

              A



     F                B
                          GET(k, R=2)

 E




         D
                  C



                                   39
Read Repair

              A



     F                B
                          GET(k, R=2)

 E




         D
                  C



                                   39
Read Repair
                      k=XYZ (v.2)
              A
                            k=XYZ (v.2)

     F                  B
                                          GET(k, R=2)

 E




         D
                  C

                         k=ABC (v.1)
                                                   39
Read Repair

              A



     F                B
                                      k=XYZ (v.2)

 E



                          UPDATE(k = XYZ)
         D
                  C



                                               39
NoSQL Break-down
      Key-value stores, Column Families,
 Document-oriented dbs, Graph databases




                                           40
Focus Of Different Data Models

              Key-Value
               Stores
Size



                                     Column
                                     Families
                                                            Document
                                                            Databases
                                                                                        Graph
                                                                                       Databases




                                                                             Complexity
       http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases   41
1) Key-value stores
                 Amazon Dynamo Paper
 Data model: collection of key-value pairs




                                             42
Voldemort                                AP

                                     LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks    Apache 2
                                     LANGUAGE

                                      Java
                                   API/PROTOCOL
                                   HTTP Java
                                     Thrift
                                      Avro
                                    Protobuf
                                   PERSISTENCE

                                   Pluggable
                                   BDB/MySQL
                                   CONCURRENCY
                                      MVCC




                                             43
Voldemort                                                 AP

                                                      LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks                     Apache 2
                                                      LANGUAGE

                                                       Java
                                   HTTP / Sockets   API/PROTOCOL
                                                    HTTP Java
                                                      Thrift
                                                       Avro
                                                     Protobuf
                                                    PERSISTENCE

                                                    Pluggable
                                                    BDB/MySQL
                                                    CONCURRENCY
                                                       MVCC




                                                              43
Voldemort                                                        AP

                                                             LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks                            Apache 2
                                                             LANGUAGE

                                                              Java
                                                           API/PROTOCOL
                               Conflicts resolved at read   HTTP Java
                                   and write time            Thrift
                                                              Avro
                                                            Protobuf
                                                           PERSISTENCE

                                                           Pluggable
                                                           BDB/MySQL
                                                           CONCURRENCY
                                                              MVCC




                                                                     43
Voldemort                                                             AP

                                                                  LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks                                 Apache 2
                                                                  LANGUAGE

                                                                   Java
                                                                API/PROTOCOL
                                                                HTTP Java
                                                                  Thrift
                                   Json, Java String, byte[],      Avro
                                                                 Protobuf
                                    Thrift, Avro, ProtoBuf
                                                                PERSISTENCE

                                                                Pluggable
                                                                BDB/MySQL
                                                                CONCURRENCY
                                                                   MVCC




                                                                          43
Voldemort                                                        AP

                                                             LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks                            Apache 2
                                                             LANGUAGE

                                                              Java
                                                           API/PROTOCOL
                                                           HTTP Java
                                                             Thrift
                                                              Avro
                                                            Protobuf
                                                           PERSISTENCE

                                                           Pluggable
                                                           BDB/MySQL
                                                           CONCURRENCY
                                                              MVCC
                               Simple optimistic locking
                                for multi-row updates,
                               pluggable storage engine
                                                                     43
Voldemort                                AP

                                     LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks    Apache 2
                                     LANGUAGE

                                      Java
                                   API/PROTOCOL
                                   HTTP Java
                                     Thrift
                                      Avro
                                    Protobuf
                                   PERSISTENCE

                                   Pluggable
                                   BDB/MySQL
                                   CONCURRENCY
                                      MVCC




                                             43
Membase                                                                               CP

                                                                                  LICENSE
DHT (K-V), no SPoF
                                                                                 Apache 2
           “VBuckets”                                                             LANGUAGE

                                                                                   C/C++
     membase            memcached                                                 Erlang
                                                                                API/PROTOCOL
      persistence         distributed
      replication         in-memory                                             REST/JSON
    (fail-over HA)                                                              memcached
      rebalancing

    Unit of consistency and replication
 Owner of a subset of the cluster key space




                  http://dustin.github.com/2010/06/29/memcached-vbuckets.html            44
Membase                                                                                CP

                                                                                   LICENSE
DHT (K-V), no SPoF
                                                                                  Apache 2
           “VBuckets”                                                              LANGUAGE

                                                                                    C/C++
     membase            memcached                                                  Erlang
                                                                                 API/PROTOCOL
      persistence         distributed
      replication         in-memory                                              REST/JSON
    (fail-over HA)                                                               memcached
      rebalancing

    Unit of consistency and replication
 Owner of a subset of the cluster key space


                                                  hash function + table lookup




                  http://dustin.github.com/2010/06/29/memcached-vbuckets.html             44
Membase                                                                                 CP

                                                                                    LICENSE
DHT (K-V), no SPoF
                                                                                   Apache 2
            “VBuckets”                                                              LANGUAGE

                                                                                     C/C++
      membase            memcached                                                  Erlang
                                                                                  API/PROTOCOL
       persistence         distributed
       replication         in-memory                                              REST/JSON
     (fail-over HA)                                                               memcached
       rebalancing

     Unit of consistency and replication
  Owner of a subset of the cluster key space


                                                   hash function + table lookup

All metadata kept in memory (high throughput / low latency).
Manual/Programmatic failover via the Management REST API.
                   http://dustin.github.com/2010/06/29/memcached-vbuckets.html             44
Riak                                                  AP

                                                  LICENSE



                                                 Apache 2
                                                  LANGUAGE

                                                C, Erlang
                                                API/PROTOCOL

                                                REST HTTP
                                                     *
                                                 ProtoBuf




                   Buckets → K-V
                 “Links” (~relations)
              Targeted JS Map/Reduce
       Tune-able consistency (one-quorum-all)
                                                         45
Redis                                                                     CP

                                                                      LICENSE
K-V store “Data Structures Server”
                                                                        BSD
Map, Set, Sorted Set, Linked List                                     LANGUAGE
Set/Queue operations, Counters, Pub-Sub, Volatile keys                ANSI C
                                                                        API

                                                                         *
                            +                                         PROTOCOL

                                                                     Telnet-
                                                                       like
                                                                    PERSISTENCE
     10-100K op/s (whole dataset in RAM + VM)
                                                                    in memory
                                                                    bg snapshots
      Persistence via snapshotting (tunable fsync freq.)            REPLICATION

                                                                    master-slave

      Distributed if client supports consistent hashing
                   http://redis.io/presentation/Redis_Cluster.pdf             46
2) Column Families
                 Google BigTable paper
   Data model: big table, column families




                                            47
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”               “anchor:cnnsi.com”            “anchor:my.look.ca”




 “com.cnn.www”         <html>...                         “CNN”                   “CNN.com”
                       column                            column                    column
row_key          row


                         http://labs.google.com/papers/bigtable-osdi06.pdf                         48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”               “anchor:cnnsi.com”            “anchor:my.look.ca”




 “com.cnn.www”         <html>...                         “CNN”                   “CNN.com”
                       column                            column                    column
row_key          row


                         http://labs.google.com/papers/bigtable-osdi06.pdf                         48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”               “anchor:cnnsi.com”            “anchor:my.look.ca”




 “com.cnn.www”         <html>...                         “CNN”                   “CNN.com”
                       column                            column                    column
row_key          row


                         http://labs.google.com/papers/bigtable-osdi06.pdf                         48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”                    “anchor:cnnsi.com”             “anchor:my.look.ca”



                         <html>...             t3
                        <html>...           t5
 “com.cnn.www”         <html>...          t6              “CNN”               t9       “CNN.com”     t8
                       column                             column                         column
row_key          row


                          http://labs.google.com/papers/bigtable-osdi06.pdf                               48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”                    “anchor:cnnsi.com”             “anchor:my.look.ca”



                         <html>...             t3
                        <html>...           t5
 “com.cnn.www”         <html>...          t6              “CNN”               t9       “CNN.com”     t8
                       column                             column                         column
row_key          row                                                 column family


                          http://labs.google.com/papers/bigtable-osdi06.pdf                               48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”                    “anchor:cnnsi.com”             “anchor:my.look.ca”



                         <html>...             t3
                        <html>...           t5
 “com.cnn.www”         <html>...          t6              “CNN”               t9       “CNN.com”     t8
                       column                             column                         column
row_key          row                                                 column family

                                                                                               ACL
                          http://labs.google.com/papers/bigtable-osdi06.pdf                               48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”                    “anchor:cnnsi.com”             “anchor:my.look.ca”



                         <html>...             t3
                        <html>...           t5
 “com.cnn.www”         <html>...          t6              “CNN”               t9       “CNN.com”     t8
                       column                             column                         column
row_key          row                                                 column family

Atomic updates                                                                                 ACL
                          http://labs.google.com/papers/bigtable-osdi06.pdf                               48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”                    “anchor:cnnsi.com”             “anchor:my.look.ca”



                         <html>...             t3
                        <html>...           t5
 “com.cnn.www”         <html>...          t6              “CNN”               t9       “CNN.com”     t8
                       column                             column                         column
row_key          row                                                 column family

Atomic updates                                      Automatic GC                               ACL
                          http://labs.google.com/papers/bigtable-osdi06.pdf                               48
Google BigTable: Data Structure
SSTable
Smallest building block
Persistent immutable Map[k,v]
Operations: lookup by key / key range scan




                          SSTable
  64KB    64KB    64KB
  block   block   block   lookup
                           index



                                             49
Google BigTable: Data Structure
SSTable
Tablet
Smallest building block range of rows
Dynamically partitioned
Persistent immutable Map[k,v]
Built from multiple SSTables
Operations: lookup and loadkey range scan
Unit of distribution by key / balancing



Tablet (range Aaa → Bar)

                           SSTable                           SSTable
  64KB    64KB     64KB              64KB    64KB    64KB
  block   block    block   lookup    block   block   block   lookup
                            index                             index



                                                                       49
Google BigTable: Data Structure
SSTable
Table
Tablet
Smallest Tablets (table segments) make up a table
Multiple building block
Dynamically partitioned range of rows
Persistent immutable Map[k,v]
Built from multiple SSTables
Operations: lookup and loadkey range scan
Unit of distribution by key / balancing

Table
Tablet (range Aaa → Bar)

                           SSTable                           SSTable
  64KB    64KB     64KB              64KB    64KB    64KB
  block   block    block   lookup    block   block   block   lookup
                            index                             index



                                                                       49
Google BigTable: I/O

           memtable              read

memory
GFS


           tablet log

                        SSTable SSTable SSTable
   write



                                                  50
Google BigTable: I/O

           memtable                  read
                          minor
                        compaction
memory
GFS


           tablet log

                          SSTable SSTable SSTable
   write



                                                    50
Google BigTable: I/O

           memtable                  read
                          minor
                        compaction
memory
GFS


           tablet log

                          SSTable SSTable SSTable
   write                                    BMDiff Zippy




                                                           50
Google BigTable: I/O

           memtable                    read
                            minor
                          compaction
memory
GFS


           tablet log

                            SSTable SSTable SSTable
   write                                      BMDiff Zippy




                        merging / major compaction (GC)
                                                             50
Google BigTable: Location Dereferencing

                                           Metadata Tablets   User Tables


                                                  ...               ...


                          Root Tablet
    Master File
                                                  ...
      Chubby                     ...                                ...

 Replicated, persisted
                            Root of the
lock service; maintains
                           metadata tree
tablet server locations

5 replicas, one elected                           ...
 master (via quorum)
                                             Up to 3 levels         ...

Paxos algorithm used                        in the metadata
 to keep consistency                            hierarchy


                                                                            51
Google BigTable: Architecture
                                                               fs metadata, ACL,
                                                              GC, load balancing

                BigTable     metadata operations               BigTable
                 client                                        master
   data R/W                                    heartbeat
   operations                                messages, GC,
                                            chunk migration


  Tablet        Tablet     Tablet                                Chubby
  Server        Server     Server                  track
                                                                 master lock,
                                                              log of live servers



  Tablet        Tablet     Tablet
                                                                                    52
HBase                                  CP

                                   LICENSE
OSS implementation of BigTable
                                  Apache 2
                                   LANGUAGE

                                    Java
                                 API/PROTOCOL

                                 REST HTTP
                                   Thrift
                                 PERSISTENCE

                                 memtable/
                                  SSTable




                                           53
HBase                                  CP

                                   LICENSE
OSS implementation of BigTable
                                  Apache 2
                                   LANGUAGE

                                    Java
     ZooKeeper as                API/PROTOCOL

     coordinator                 REST HTTP
                                   Thrift
(instead of Chubby)
                                 PERSISTENCE

                                 memtable/
                                  SSTable




                                           53
HBase                                                     CP

                                                      LICENSE
OSS implementation of BigTable
                                                     Apache 2
                                                      LANGUAGE

                                  Support for          Java
                                 multiple masters   API/PROTOCOL

                                                    REST HTTP
                                                      Thrift
                                                    PERSISTENCE

                                                    memtable/
                                                     SSTable




                                                              53
HBase                                  CP

                                   LICENSE
OSS implementation of BigTable
                                  Apache 2
                                   LANGUAGE

                                    Java
                                 API/PROTOCOL

                                 REST HTTP
                                   Thrift
                                 PERSISTENCE

                                 memtable/
                                  SSTable

HDFS, S3, S3N, EBS
 (with GZip/LZO
 CF compression)



                                           53
HBase                                            CP

                                             LICENSE
OSS implementation of BigTable
                                            Apache 2
                                             LANGUAGE

                                              Java
                                           API/PROTOCOL

                                           REST HTTP
                                             Thrift
                                           PERSISTENCE

                                           memtable/
                   Data sorted by key       SSTable
                  but evenly distributed
                    across the cluster




                                                     53
HBase                                                     CP

                                                      LICENSE
OSS implementation of BigTable
                                                     Apache 2
                                                      LANGUAGE

                                                       Java
                                                    API/PROTOCOL

                                                    REST HTTP
                                                      Thrift
                                                    PERSISTENCE

                                                    memtable/
                                                     SSTable
                                 Batch Streaming,
                                   Map/Reduce




                                                              53
HBase                                  CP

                                   LICENSE
OSS implementation of BigTable
                                  Apache 2
                                   LANGUAGE

                                    Java
                                 API/PROTOCOL

                                 REST HTTP
                                   Thrift
                                 PERSISTENCE

                                 memtable/
                                  SSTable




                                           53
Hypertable                              CP

                                    LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s)        GPLv2
                                    LANGUAGE

                                      C++
                                  API/PROTOCOL

                                      C++
                                    Thrift
                                  PERSISTENCE

                                  memtable/
                                   SSTable
                                  CONCURRENCY
                                     MVCC




                               HQL (~SQL)
                                            54
Hypertable                              CP

                                    LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s)        GPLv2
                                    LANGUAGE

                                      C++
                                  API/PROTOCOL

                                      C++
   Hyperspace                       Thrift
   (paxos) used                   PERSISTENCE

    instead of                    memtable/
                                   SSTable
   ZooKeeper
                                  CONCURRENCY
                                     MVCC




                               HQL (~SQL)
                                            54
Hypertable                                            CP

                                                  LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s)                      GPLv2
                                                  LANGUAGE

                                                    C++
                                                API/PROTOCOL

                                                    C++
                                                  Thrift
                               Dynamically      PERSISTENCE
                                adapts to
                                                memtable/
                               changes in        SSTable
                                workload        CONCURRENCY
                                                   MVCC




                                             HQL (~SQL)
                                                          54
Hypertable                              CP

                                    LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s)        GPLv2
                                    LANGUAGE

                                      C++
                                  API/PROTOCOL

                                      C++
                                    Thrift
                                  PERSISTENCE

                                  memtable/
                                   SSTable
                                  CONCURRENCY
                                     MVCC




                               HQL (~SQL)
                                            54
Cassandra                                                                                                AP

                                                                                                      LICENSE
      Data model of BigTable, infrastructure of Dynamo
                                                                                                    Apache 2
                                                                                                     LANGUAGE

                                                                                                       Java
                                                                                                     PROTOCOL
                                                   B
                                      col_name                                                       Thrift
                                                                                                      Avro
                                      col_value                                                     PERSISTENCE
                                      timestamp
                                                                                                    memtable/
                                                                                                     SSTable
                                      Column
                                                                                                    CONSISTENCY

                                                                                                     Tunable
                                                                                                      R/W/N

                      x




http://www.javageneration.com/?p=70    @cassandralondon   http://www.meetup.com/Cassandra-London/             55
Cassandra                                                                                                      AP

                                                                                                            LICENSE
      Data model of BigTable, infrastructure of Dynamo
                                                                                                          Apache 2
                                                                                                           LANGUAGE

                          super_column_name                                                                  Java
                                                                                                           PROTOCOL
                                                         B
                    col_name                col_name                                                       Thrift
                                      ...                                                                   Avro
                                                                                                          PERSISTENCE
                    col_value               col_value
                     timestamp              timestamp
                                                                                                          memtable/
                                                                                                           SSTable
                                                                                                          CONSISTENCY

                                                                                                           Tunable
                                                                                                            R/W/N

                      x




http://www.javageneration.com/?p=70          @cassandralondon   http://www.meetup.com/Cassandra-London/             55
Cassandra                                                                                                      AP

                                                                                                            LICENSE
      Data model of BigTable, infrastructure of Dynamo
                            Column Family                                                                 Apache 2
                                                                                                           LANGUAGE

                                                                                                             Java
                                                                                                           PROTOCOL
                                                         B
                    col_name                col_name                                                       Thrift
row_key
                                      ...                                                                   Avro
                                                                                                          PERSISTENCE
                    col_value               col_value
                     timestamp              timestamp
                                                                                                          memtable/
                                                                                                           SSTable
                                                                                                          CONSISTENCY

                                                                                                           Tunable
                                                                                                            R/W/N

                      x




http://www.javageneration.com/?p=70          @cassandralondon   http://www.meetup.com/Cassandra-London/             55
Cassandra                                                                                                       AP

                                                                                                             LICENSE
      Data model of BigTable, infrastructure of Dynamo
                                                   Super Column Family                                     Apache 2
                                                                                                            LANGUAGE

                          super_column_name                            super_column_name                      Java
                                                                                                            PROTOCOL
                                                         B
                    col_name                col_name               col_name             col_name            Thrift
row_key
                                      ...                  ...                    ...                        Avro
                    col_value               col_value              col_value            col_value          PERSISTENCE
                     timestamp              timestamp               timestamp            timestamp
                                                                                                           memtable/
                                                                                                            SSTable
                                                                                                           CONSISTENCY
keyspace.get("column_family",
key,
["super_column",]
"column")
                                                                                                            Tunable
                                                                                                             R/W/N

                      x




http://www.javageneration.com/?p=70          @cassandralondon    http://www.meetup.com/Cassandra-London/             55
Cassandra                                                                                                           AP

                                                                                                                 LICENSE
      Data model of BigTable, infrastructure of Dynamo
                                                       Super Column Family                                     Apache 2
                                                                                                                LANGUAGE

                                   super_column_name                       super_column_name                      Java
                                                                                                                PROTOCOL
                                                             B
                       col_name                 col_name               col_name             col_name            Thrift
row_key
                                          ...                  ...                    ...                        Avro
                       col_value                col_value              col_value            col_value          PERSISTENCE
                           timestamp            timestamp               timestamp            timestamp
                                                                                                               memtable/
                                                                                                                SSTable
                                                                                                               CONSISTENCY
keyspace.get("column_family",
key,
["super_column",]
"column")
                                                                                                                Tunable
                  B                                                                                              R/W/N
          A
                           C
               P2P
      F       Gossip           x


                       D
              E


http://www.javageneration.com/?p=70              @cassandralondon    http://www.meetup.com/Cassandra-London/             55
Cassandra                                                                                                           AP

                                                                                                                 LICENSE
      Data model of BigTable, infrastructure of Dynamo
                                                       Super Column Family                                     Apache 2
                                                                                                                LANGUAGE

                                   super_column_name                       super_column_name                      Java
                                                                                                                PROTOCOL
                                                             B
                       col_name                 col_name               col_name             col_name            Thrift
row_key
                                          ...                  ...                    ...                        Avro
                       col_value                col_value              col_value            col_value          PERSISTENCE
                           timestamp            timestamp               timestamp            timestamp
                                                                                                               memtable/
                                                                                                                SSTable
                                                                                                               CONSISTENCY
keyspace.get("column_family",
key,
["super_column",]
"column")
                                                                                                                Tunable
                  B                                                                                              R/W/N
          A
                           C              ALL
               P2P
                                         ONE
      F       Gossip           x
                                        QUORUM
                       D
              E


http://www.javageneration.com/?p=70              @cassandralondon    http://www.meetup.com/Cassandra-London/             55
Cassandra                                                                                                           AP

                                                                                                                 LICENSE
      Data model of BigTable, infrastructure of Dynamo
                                                       Super Column Family                                     Apache 2
                                                                                                                LANGUAGE

                                   super_column_name                       super_column_name                      Java
                                                                                                                PROTOCOL
                                                              B
                       col_name                 col_name               col_name             col_name            Thrift
row_key
                                          ...                  ...                    ...                        Avro
                       col_value                col_value              col_value            col_value          PERSISTENCE
                           timestamp            timestamp               timestamp            timestamp
                                                                                                               memtable/
                                                                                                                SSTable
                                                                                                               CONSISTENCY
keyspace.get("column_family",
key,
["super_column",]
"column")
                                                                                                                Tunable
                  B                                                                                              R/W/N
          A                                                       RandomPartitioner (MD5)
                           C              ALL
               P2P                                                OrderPreservingPartitioner
                                         ONE
      F       Gossip           x
                                        QUORUM
              E
                       D                                    Range Scans, Fulltext Index (Solandra)
http://www.javageneration.com/?p=70              @cassandralondon    http://www.meetup.com/Cassandra-London/             55
3) Document DBs
                            Lotus Notes
Data model: collection of K-V collections




                                            56
CouchDB                                                                             AP

                                                                                LICENSE



                                                                               Apache 2
                                                                                LANGUAGE
JSON docs
                                                                                Erlang
                                                                              API/PROTOCOL

                                                                              REST/JSON
                                                                              PERSISTENCE
                                                                              Append-only
                                                                                B+Tree
                                                                              CONCURRENCY

                                                                                 MVCC
                                                                              CONSISTENCY

                                                                              crash-only
                                                                                design
                                                                              REPLICATION
                                                                              multi-master
            http://horicky.blogspot.com/2008/10/couchdb-implementation.html             57
CouchDB                                                                             AP

                                                                                LICENSE



                                                                               Apache 2
                                                                                LANGUAGE
JSON docs    map-reduce “views”
            (materialised resultset)                                            Erlang
                                                                              API/PROTOCOL

                                                                              REST/JSON
                                                                              PERSISTENCE
                                                                              Append-only
                                                                                B+Tree
                                                                              CONCURRENCY

                                                                                 MVCC
                                                                              CONSISTENCY

                                                                              crash-only
                                                                                design
                                                                              REPLICATION
                                                                              multi-master
            http://horicky.blogspot.com/2008/10/couchdb-implementation.html             57
CouchDB                                                                              AP

                                                                                 LICENSE



                                                                                Apache 2
                                                                                 LANGUAGE
JSON docs    map-reduce “views”              Storage + View Indexes (B+Tree)
            (materialised resultset)          [by_id_index, by_seqnum_index]     Erlang
                                                                               API/PROTOCOL

                                                                               REST/JSON
                                                                               PERSISTENCE
                                                                               Append-only
                                                                                 B+Tree
                                                                               CONCURRENCY

                                                                                  MVCC
                                                                               CONSISTENCY

                                                                               crash-only
                                                                                 design
                                                                               REPLICATION
                                                                               multi-master
            http://horicky.blogspot.com/2008/10/couchdb-implementation.html              57
CouchDB                                                                                        AP

                                                                                           LICENSE



                                                                                          Apache 2
                                                                                           LANGUAGE
 JSON docs         map-reduce “views”                  Storage + View Indexes (B+Tree)
                  (materialised resultset)              [by_id_index, by_seqnum_index]     Erlang
                                                                                         API/PROTOCOL

                                                                                         REST/JSON
                                                                                         PERSISTENCE
                                                                                         Append-only
  Replication used                                                                         B+Tree
  as a way to scale                                                                      CONCURRENCY
transactions volume
                                                                                            MVCC
                                                                                         CONSISTENCY

                                                                                         crash-only
                                                                                           design
                                                                                         REPLICATION
                                                                                         multi-master
                      http://horicky.blogspot.com/2008/10/couchdb-implementation.html              57
CouchDB                                                                                        AP

                                                                                           LICENSE



                                                                                          Apache 2
                                                                                           LANGUAGE
 JSON docs         map-reduce “views”                  Storage + View Indexes (B+Tree)
                  (materialised resultset)              [by_id_index, by_seqnum_index]     Erlang
                                                                                         API/PROTOCOL

                                                                                         REST/JSON
                                                                                         PERSISTENCE
                                                                                         Append-only
  Replication used                                                                         B+Tree
  as a way to scale              Conflict Resolution                                      CONCURRENCY
transactions volume              at application level
                                                                                            MVCC
                                                                                         CONSISTENCY

                                                                                         crash-only
                                                                                           design
                                                                                         REPLICATION
                                                                                         multi-master
                      http://horicky.blogspot.com/2008/10/couchdb-implementation.html              57
CouchDB                                                                                          AP

                                                                                             LICENSE



                                                                                            Apache 2
                                                                                             LANGUAGE
 JSON docs         map-reduce “views”                  Storage + View Indexes (B+Tree)
                  (materialised resultset)              [by_id_index, by_seqnum_index]       Erlang
                                                                                           API/PROTOCOL

                                                                                           REST/JSON
                                                                                           PERSISTENCE
                                                                                           Append-only
  Replication used                                                                           B+Tree
  as a way to scale              Conflict Resolution                MVCC (copy-on-modify)   CONCURRENCY
transactions volume              at application level                Volatile Versioning
                                                                                              MVCC
                                                                                           CONSISTENCY

                                                                                           crash-only
                                                                                             design
                                                                                           REPLICATION
                                                                                           multi-master
                      http://horicky.blogspot.com/2008/10/couchdb-implementation.html                57
CouchDB                                                                                          AP

                                                                                               LICENSE



                                                                                              Apache 2
                                                                                               LANGUAGE
   JSON docs         map-reduce “views”                  Storage + View Indexes (B+Tree)
                    (materialised resultset)              [by_id_index, by_seqnum_index]       Erlang
                                                                                             API/PROTOCOL

                                                                                             REST/JSON
                                                                                             PERSISTENCE
                                                                                             Append-only
    Replication used                                                                           B+Tree
    as a way to scale              Conflict Resolution                MVCC (copy-on-modify)   CONCURRENCY
  transactions volume              at application level                Volatile Versioning
                                                                                                MVCC
                                                                                             CONSISTENCY

                                                                                             crash-only
                                                                                               design
  Online Compaction                                                                          REPLICATION
(very primitive VACUUM)                                                                      multi-master
                        http://horicky.blogspot.com/2008/10/couchdb-implementation.html                57
CouchDB                                                                                          AP

                                                                                               LICENSE



                                                                                              Apache 2
                                                                                               LANGUAGE
   JSON docs         map-reduce “views”                  Storage + View Indexes (B+Tree)
                    (materialised resultset)              [by_id_index, by_seqnum_index]       Erlang
                                                                                             API/PROTOCOL

                                                                                             REST/JSON
                                                                                             PERSISTENCE
                                                                                             Append-only
    Replication used                                                                           B+Tree
    as a way to scale              Conflict Resolution                MVCC (copy-on-modify)   CONCURRENCY
  transactions volume              at application level                Volatile Versioning
                                                                                                MVCC
                                                                                             CONSISTENCY

                                                                                             crash-only
                                                                                               design
  Online Compaction                 Update validation /                                      REPLICATION
(very primitive VACUUM)               Auth triggers                                          multi-master
                        http://horicky.blogspot.com/2008/10/couchdb-implementation.html                57
CouchDB                                                                                            AP

                                                                                                 LICENSE



                                                                                                Apache 2
                                                                                                 LANGUAGE
   JSON docs         map-reduce “views”                  Storage + View Indexes (B+Tree)
                    (materialised resultset)              [by_id_index, by_seqnum_index]         Erlang
                                                                                               API/PROTOCOL

                                                                                               REST/JSON
                                                                                               PERSISTENCE
                                                                                               Append-only
    Replication used                                                                             B+Tree
    as a way to scale              Conflict Resolution                MVCC (copy-on-modify)     CONCURRENCY
  transactions volume              at application level                Volatile Versioning
                                                                                                  MVCC
                                                                                               CONSISTENCY

                                                                                               crash-only
                                                                                                 design
  Online Compaction                 Update validation /                   Delayed commits      REPLICATION
(very primitive VACUUM)               Auth triggers                      (write performance)   multi-master
                        http://horicky.blogspot.com/2008/10/couchdb-implementation.html                  57
CouchDB                                                                                    AP

                                                                                       LICENSE

                  MVCC consequences:
                                                                                      Apache 2
                compaction load, disk space                                            LANGUAGE

                                                                                       Erlang
                                                                                     API/PROTOCOL

                                                                                     REST/JSON
                                                                                     PERSISTENCE
                                                                                     Append-only
                                                                                       B+Tree
                                                                                     CONCURRENCY

                                                                                        MVCC
                                                                                     CONSISTENCY

                                                                                     crash-only
http://enda.squarespace.com/tech/2009/12/8/couchdb-compaction-big-impacts.html         design
                                                                                     REPLICATION
        http://chesnok.com/talks/mvcc_couchcamp.pdf (PgSQL VACUUM)
                                                                                     multi-master
                   http://horicky.blogspot.com/2008/10/couchdb-implementation.html             58
MongoDB                      CP

                         LICENSE
   bson_encode()
   bson_decode()
                         AGPLv3
BSON serialisation       LANGUAGE

(storage & transfer)       C++
                       API/PROTOCOL

                       REST/BSON
                           *
                       PERSISTENCE
                        B+Tree,
                       Snapshots
                       CONCURRENCY

                        In-place
                         updates
                       REPLICATION

                       master-slave
                       replica sets


                                 59
MongoDB                                       CP

                                          LICENSE
   bson_encode()
   bson_decode()
                                          AGPLv3
                       Auto-Sharding,     LANGUAGE
BSON serialisation
                        Master-Slave,
(storage & transfer)                        C++
                       Auto-Failover
                                        API/PROTOCOL

                                        REST/BSON
                                            *
                                        PERSISTENCE
                                         B+Tree,
                                        Snapshots
                                        CONCURRENCY

                                         In-place
                                          updates
                                        REPLICATION

                                        master-slave
                                        replica sets


                                                  59
MongoDB                                                                 CP

                                                                    LICENSE
   bson_encode()
   bson_decode()
                                                                    AGPLv3
                       Auto-Sharding,                               LANGUAGE
BSON serialisation                         B-Tree Indexes
                        Master-Slave,
(storage & transfer)                    (on different cols too)       C++
                       Auto-Failover
                                                                  API/PROTOCOL

                                                                  REST/BSON
                                                                      *
                                                                  PERSISTENCE
                                                                   B+Tree,
                                                                  Snapshots
                                                                  CONCURRENCY

                                                                   In-place
                                                                    updates
                                                                  REPLICATION

                                                                  master-slave
                                                                  replica sets


                                                                            59
MongoDB                                                                 CP

                                                                    LICENSE
   bson_encode()
   bson_decode()
                                                                    AGPLv3
                       Auto-Sharding,                               LANGUAGE
BSON serialisation                         B-Tree Indexes
                        Master-Slave,
(storage & transfer)                    (on different cols too)       C++
                       Auto-Failover
                                                                  API/PROTOCOL

                                                                  REST/BSON
        v.1                                                           *
        v.2
                                                                  PERSISTENCE

Update in place                                                    B+Tree,
                                                                  Snapshots
(no versioning, no
                                                                  CONCURRENCY
append-only log)
                                                                   In-place
                                                                    updates
                                                                  REPLICATION

                                                                  master-slave
                                                                  replica sets


                                                                            59
MongoDB                                                                      CP

                                                                         LICENSE
   bson_encode()
   bson_decode()
                                                                         AGPLv3
                         Auto-Sharding,                                  LANGUAGE
BSON serialisation                              B-Tree Indexes
                          Master-Slave,
(storage & transfer)                         (on different cols too)       C++
                         Auto-Failover
                                                                       API/PROTOCOL

                                                                       REST/BSON
        v.1                                                                *
        v.2
                                                                       PERSISTENCE

Update in place                                                         B+Tree,
                       Geo-Spatial Indexes                             Snapshots
(no versioning, no
                                                                       CONCURRENCY
append-only log)
                                                                        In-place
                                                                         updates
                                                                       REPLICATION

                                                                       master-slave
                                                                       replica sets


                                                                                 59
MongoDB                                                                      CP

                                                                         LICENSE
   bson_encode()
   bson_decode()
                                                                         AGPLv3
                         Auto-Sharding,                                  LANGUAGE
BSON serialisation                              B-Tree Indexes
                          Master-Slave,
(storage & transfer)                         (on different cols too)       C++
                         Auto-Failover
                                                                       API/PROTOCOL

                                                                       REST/BSON
        v.1                                                                *
        v.2
                                                                       PERSISTENCE

Update in place                                  Persistence via        B+Tree,
                       Geo-Spatial Indexes                             Snapshots
(no versioning, no                                Replication +
                                                                       CONCURRENCY
append-only log)                                  Snapshotting
                                                                        In-place
                                                                         updates
                                                                       REPLICATION

                                                                       master-slave
                                                                       replica sets


                                                                                 59
MongoDB                                                                      CP

                                                                         LICENSE
   bson_encode()
   bson_decode()
                                                                         AGPLv3
                         Auto-Sharding,                                  LANGUAGE
BSON serialisation                              B-Tree Indexes
                          Master-Slave,
(storage & transfer)                         (on different cols too)       C++
                         Auto-Failover
                                                                       API/PROTOCOL

                                                                       REST/BSON
         v.1                                                               *
         v.2
                                                                       PERSISTENCE

Update in place                                  Persistence via        B+Tree,
                       Geo-Spatial Indexes                             Snapshots
(no versioning, no                                Replication +
                                                                       CONCURRENCY
append-only log)                                  Snapshotting
                                                                        In-place
                                                                         updates
     GROUP BY                                                          REPLICATION

                                                                       master-slave
   Map/Reduce                                                          replica sets
 (well, aggregation)
                                                                                 59
MongoDB                                                                         CP

                                                                            LICENSE
   bson_encode()
   bson_decode()
                                                                            AGPLv3
                         Auto-Sharding,                                     LANGUAGE
BSON serialisation                                 B-Tree Indexes
                          Master-Slave,
(storage & transfer)                            (on different cols too)       C++
                         Auto-Failover
                                                                          API/PROTOCOL

                                                                          REST/BSON
         v.1                                                                  *
         v.2
                                                                          PERSISTENCE

Update in place                                     Persistence via        B+Tree,
                       Geo-Spatial Indexes                                Snapshots
(no versioning, no                                   Replication +
                                                                          CONCURRENCY
append-only log)                                     Snapshotting
                                                                           In-place
                                                                            updates
     GROUP BY                                                             REPLICATION

                                                                          master-slave
   Map/Reduce
                       No ACK on Updates                                  replica sets
 (well, aggregation)
                       (or ensure N replicas)
                                                                                    59
MongoDB                                                                         CP

                                                                            LICENSE
   bson_encode()
   bson_decode()
                                                                            AGPLv3
                         Auto-Sharding,                                     LANGUAGE
BSON serialisation                                 B-Tree Indexes
                          Master-Slave,
(storage & transfer)                            (on different cols too)       C++
                         Auto-Failover
                                                                          API/PROTOCOL

                                                                          REST/BSON
         v.1                                                                  *
         v.2
                                                                          PERSISTENCE

Update in place                                     Persistence via        B+Tree,
                       Geo-Spatial Indexes                                Snapshots
(no versioning, no                                   Replication +
                                                                          CONCURRENCY
append-only log)                                     Snapshotting
                                                                           In-place
                                                                            updates
     GROUP BY                                                             REPLICATION
                                                        New!
                                                      Improved!           master-slave
   Map/Reduce                                         --dur flag
                       No ACK on Updates                                  replica sets
 (well, aggregation)
                       (or ensure N replicas)
                                                                                    59
4) Graph databases
            Graph Theory




                           60
Neo4j
                                                                                                 LICENSE

                                                                                                AGPLv3 /
                                                                                               Commercial
                                                                                                 LANGUAGE

                                                                                                  Java
                                                        Nodes,          Vertical Scalability
                                                                                               API/PROTOCOL
         Graph Data Structure                        Relationships,     (1000s times faster,
                                                                        but not distributed)      REST
                                                   Properties on both                             Java
                                                                                                 SPARQL
                                                                                               PERSISTENCE
                                                                                                 On-disk
                                                              Physical structure:              linked-list
                                                           LinkedList stored on disk




    HA cluster with ZooKeeper                                 High-performance                 SPARQL
      (nodes = exact replicas)                                 node traversal
http://docs.neo4j.org/chunked/stable/ha-how.html                                                         61
Neo4j
                                                                                    LICENSE

                  NeoService neo = ... // factory                                  AGPLv3 /
                                                                                  Commercial
                                                                                    LANGUAGE
                  Transaction tx = neo.beginTx();
                                                                                     Java
                                     Nodes,                Vertical Scalability
              Node Structure
                    n1 = neo.createNode();                 (1000s times faster,
                                                                                  API/PROTOCOL
         Graph Data               Relationships,
              n1.setProperty(“name”, “John”);              but not distributed)      REST
                                Properties on both                                   Java
                  n1.setProperty(“age”, 35);
                                                                                    SPARQL
                                                                                  PERSISTENCE
                  Node n2 = neo.createNode();
                  n2.setProperty(“name”, “Mary”); structure:                        On-disk
                                             Physical                             linked-list
                  n2.setProperty(“age”, 29);
                                          LinkedList stored on disk
                  n2.setProperty(“job”, “engineer”);

                  n1.createRelationshipTo(n2, RelTypes.KNOWS);

    HA cluster with ZooKeeper
             tx.commit();                          High-performance               SPARQL
      (nodes = exact replicas)                      node traversal
http://docs.neo4j.org/chunked/stable/ha-how.html                                            61
Neo4j
                                                                               LICENSE

             Traverser neo = ... // factory
             NeoServicefriendTraverser = n1.traverse(                         AGPLv3 /
                  Traverser.order.BREADTH_FIRST,                             Commercial
                                                                               LANGUAGE
             Transaction tx = neo.beginTx();
                  StopEvaluator.END_OF_GRAPH,
                  ReturnableEvaluator.ALL_BUT_START_NODE,                       Java
                                     Nodes,           Vertical Scalability
       GraphNodeRelTypes.KNOWS,Relationships,
                    n1 = neo.createNode();                                   API/PROTOCOL
              Data Structure                         (1000s times faster,
             n1.setProperty(“name”, “John”); but not distributed)
                  Direction.OUTGOING                                            REST
                                Properties on both                              Java
             );
             n1.setProperty(“age”, 35);
                                                                               SPARQL
             // Traverse the node space
                                                                             PERSISTENCE
             System.out.println(“John’s
             Node n2 = neo.createNode();friends: ”);
             for (Node friend : friendsTraverser) {
             n2.setProperty(“name”, “Mary”); structure:                        On-disk
                                            Physical                         linked-list
             n2.setProperty(“age”, 29);depth stored on disk
                  System.out.printf(“At  LinkedList %d => %s%n”,
             n2.setProperty(“job”, “engineer”);
                     friendTraverser.currentPosition().
                       getDepth(),
             n1.createRelationshipTo(n2, RelTypes.KNOWS);
                     friendTraverser.getProperty(“name”)
                  );
    HA cluster with ZooKeeper
             }
             tx.commit();                   High-performance                 SPARQL
        (nodes = exact replicas)                   node traversal
http://docs.neo4j.org/chunked/stable/ha-how.html                                       61
Final Considerations
                   Query modes
       Achievements and Problems




                                   62
Query Modes: a new “SQL”?

          Map/Reduce




                            63
SQL vs. Map/Reduce

    SELECT
               19OPQ                                              db.runCommand({
                                                                                    A*2=*LR
        Dim1, Dim2,                             !                 mapreduce: "DenormAggCollection",
        SUM(Measure1) AS MSum,                                    query: {
                                                "
        COUNT(*) AS RecordCount,                                       filter1: { '$in': [ 'A', 'B' ] },
        AVG(Measure2) AS MAvg,                  #                      filter2: 'C',
        MIN(Measure1) AS MMin                                          filter3: { '$gt': 123 }
        MAX(CASE                                                    },
           WHEN Measure2 < 100                  $                 map: function() { emit(
           THEN Measure2                                               { d1: this.Dim1, d2: this.Dim2 },
        END) AS MMax                                                   { msum: this.measure1, recs: 1, mmin: this.measure1,
    FROM DenormAggTable                                                  mmax: this.measure2 < 100 ? this.measure2 : 0 }
    WHERE (Filter1 IN (’A’,’B’))                                    );},
        AND (Filter2 = ‘C’)                     %                 reduce: function(key, vals) {
        AND (Filter3 > 123)                                            var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 };
    GROUP BY Dim1, Dim2                         !                      for(var i = 0; i < vals.length; i++) {
    HAVING (MMin > 0)                                                    ret.msum += vals[i].msum;
    ORDER BY RecordCount DESC                                            ret.recs += vals[i].recs;
    LIMIT 4, 8                                                           if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin;
                                                                         if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax))
                                                                           ret.mmax = vals[i].mmax;
                                                                       }
    !   ()*+,-./.01-230*2/4*5+123/6)-/,+55-./                          return ret;
        *+7/63/8-93/02/7:-/16,/;+2470*2</                           },
        )-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@                     finalize: function(key, val) {
                                                        '
    "   A-63+)-3/1+37/B-/162+6559/6==)-=67-.@       &                  val.mavg = val.msum / val.recs;
                                                                       return val;
    # C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/
                                                                    },




                                                                                                                          G-E030*2/$</M)-67-./"N!NIN#IN'
                                                                                                                          G048/F3B*)2-</)048*3B*)2-@*)=
      1+37/?607/+2705/;02650>670*2@
    $ A-63+)-3/462/+3-/,)*4-.+)65/5*=04@
                                                                  out: 'result1',
                                                                  verbose: true
    % D057-)3/:6E-/62/FGAHC470E-G-4*).I                           });
      5**802=/3795-@
                                                                  db.result1.
    ' C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/
      7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@
                                                                    find({ mmin: { '$gt': 0 } }).
    & C34-2.02=J/!K/L-34-2.02=J/I!
                                                                    sort({ recs: -1 }).
                                                                    skip(4).
                                                                    limit(8);

                                        http://rickosborne.org/download/SQL-to-MongoDB.pdf                                                                 64
Pig vs. Map/Reduce




                     65
Pig vs. Map/Reduce




                     65
Data model, Relations and Consistency



       A step backwards?




                                    66
Data model, Relations and Consistency



        A step backwards?

       Scalability, availability and resilience
                   come at a cost


                                                  66
Big Data

            collect
             store
           organise
           analyse
             share    Werner Vogels, CTO, Amazon
                      - STRATA Conf 2011
                                               67
Big Data

                      collect
 we don’t always
 know up-front         store
 which questions
we’re going to ask   organise
                     analyse
                       share    Werner Vogels, CTO, Amazon
                                - STRATA Conf 2011
                                                         67
We’re Hiring!




http://mediasift.com/careers
                               68
Lorenzo Alberton
                   @lorenzoalberton




   Thank you!
       lorenzo@alberton.info
http://www.alberton.info/talks




   http://joind.in/talk/view/2517
                                      69

More Related Content

PDF
Introduction to MongoDB
PPTX
Relational databases vs Non-relational databases
PDF
Introduction to Apache Cassandra
PPTX
NoSQL databases - An introduction
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Intro to HBase
PDF
Nosql data models
PDF
Hadoop ecosystem
Introduction to MongoDB
Relational databases vs Non-relational databases
Introduction to Apache Cassandra
NoSQL databases - An introduction
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Intro to HBase
Nosql data models
Hadoop ecosystem

What's hot (20)

PPT
Introduction to MongoDB
PPTX
The Basics of MongoDB
PDF
Big Data Architecture
PPTX
Hadoop introduction , Why and What is Hadoop ?
PDF
Snowflake SnowPro Certification Exam Cheat Sheet
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
PDF
Spark SQL
PDF
Oracle Database Availability & Scalability Across Versions & Editions
PDF
Cassandra Introduction & Features
PPTX
Introduction to NoSQL Databases
PPTX
Apache Spark Architecture
PPT
Schemaless Databases
PDF
Les BD NoSQL
KEY
Introduction to memcached
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPT
7. Key-Value Databases: In Depth
PPTX
Key-Value NoSQL Database
PPTX
Apache Spark overview
PPTX
Mongo db intro.pptx
Introduction to MongoDB
The Basics of MongoDB
Big Data Architecture
Hadoop introduction , Why and What is Hadoop ?
Snowflake SnowPro Certification Exam Cheat Sheet
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
Spark SQL
Oracle Database Availability & Scalability Across Versions & Editions
Cassandra Introduction & Features
Introduction to NoSQL Databases
Apache Spark Architecture
Schemaless Databases
Les BD NoSQL
Introduction to memcached
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
7. Key-Value Databases: In Depth
Key-Value NoSQL Database
Apache Spark overview
Mongo db intro.pptx
Ad

Viewers also liked (14)

PDF
QConSF 2014 talk on Netflix Mantis, a stream processing system
PDF
Monitoring at scale - Intuitive dashboard design
KEY
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
KEY
Scalable Architectures - Taming the Twitter Firehose
KEY
Scaling Teams, Processes and Architectures
KEY
The Art of Scalability - Managing growth
PPTX
Introduction to NoSQL
PPTX
An Introduction to Big Data, NoSQL and MongoDB
PDF
NoSQL Databases
ZIP
NoSQL databases
KEY
Graphs in the Database: Rdbms In The Social Networks Age
PPT
NoSQL databases pros and cons
PDF
NoSQL databases
KEY
Trees In The Database - Advanced data structures
QConSF 2014 talk on Netflix Mantis, a stream processing system
Monitoring at scale - Intuitive dashboard design
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Scalable Architectures - Taming the Twitter Firehose
Scaling Teams, Processes and Architectures
The Art of Scalability - Managing growth
Introduction to NoSQL
An Introduction to Big Data, NoSQL and MongoDB
NoSQL Databases
NoSQL databases
Graphs in the Database: Rdbms In The Social Networks Age
NoSQL databases pros and cons
NoSQL databases
Trees In The Database - Advanced data structures
Ad

Similar to NoSQL Databases: Why, what and when (20)

PDF
NoSQL Database
PDF
Twp perf-oracle-1
PPTX
Floodlight tutorial - Clemson / Georgia Tech
PPT
NoSql Databases
PPTX
Sql 2012 always on
PDF
Grails and the World of Tomorrow
PPTX
Samba management Console
PDF
When NOT to use MongoDB
PPTX
apidays Paris 2024 - Hexagonal Modules, Adil Baaj, Theodo
PDF
Architecting for failure - Why are distributed systems hard?
PDF
Santo Leto - MySQL Connect 2012 - Getting Started with Mysql Cluster
PDF
MongoDB Ops Manager + Kubernetes
PPT
Fundamentals Of Transaction Systems - Part 2: Certainty suppresses Uncertaint...
PPTX
Scality, Cloud Storage pour Zimbra
PPT
Clustered Architecture Patterns Delivering Scalability And Availability
ODP
Enterprise Java in 2012 and Beyond, by Juergen Hoeller
PDF
Oracle SQL Basics by Ankur Raina
PDF
Dell Networking Switch Configuration Examples
PPTX
The architecture of oak
PDF
The View - Leveraging Lotuscript for Database Connectivity
NoSQL Database
Twp perf-oracle-1
Floodlight tutorial - Clemson / Georgia Tech
NoSql Databases
Sql 2012 always on
Grails and the World of Tomorrow
Samba management Console
When NOT to use MongoDB
apidays Paris 2024 - Hexagonal Modules, Adil Baaj, Theodo
Architecting for failure - Why are distributed systems hard?
Santo Leto - MySQL Connect 2012 - Getting Started with Mysql Cluster
MongoDB Ops Manager + Kubernetes
Fundamentals Of Transaction Systems - Part 2: Certainty suppresses Uncertaint...
Scality, Cloud Storage pour Zimbra
Clustered Architecture Patterns Delivering Scalability And Availability
Enterprise Java in 2012 and Beyond, by Juergen Hoeller
Oracle SQL Basics by Ankur Raina
Dell Networking Switch Configuration Examples
The architecture of oak
The View - Leveraging Lotuscript for Database Connectivity

Recently uploaded (20)

PDF
Phone away, tabs closed: No multitasking
PPTX
HPE Aruba-master-icon-library_052722.pptx
PDF
YOW2022-BNE-MinimalViableArchitecture.pdf
PDF
Facade & Landscape Lighting Techniques and Trends.pptx.pdf
PPTX
AD Bungalow Case studies Sem 2.pptxvwewev
PDF
Urban Design Final Project-Context
PPTX
Wisp Textiles: Where Comfort Meets Everyday Style
PPT
EGWHermeneuticsffgggggggggggggggggggggggggggggggg.ppt
PDF
Interior Structure and Construction A1 NGYANQI
PPTX
LITERATURE CASE STUDY DESIGN SEMESTER 5.pptx
PPT
WHY_R12 Uaafafafpgradeaffafafafaffff.ppt
PPTX
DOC-20250430-WA0014._20250714_235747_0000.pptx
PPTX
mahatma gandhi bus terminal in india Case Study.pptx
PDF
Quality Control Management for RMG, Level- 4, Certificate
PDF
UNIT 1 Introduction fnfbbfhfhfbdhdbdto Java.pptx.pdf
PDF
Chalkpiece Annual Report from 2019 To 2025
PPTX
EDP Competencies-types, process, explanation
PPTX
ANATOMY OF ANTERIOR CHAMBER ANGLE AND GONIOSCOPY.pptx
PDF
Urban Design Final Project-Site Analysis
PPTX
CLASS_11_BUSINESS_STUDIES_PPT_CHAPTER_1_Business_Trade_Commerce.pptx
Phone away, tabs closed: No multitasking
HPE Aruba-master-icon-library_052722.pptx
YOW2022-BNE-MinimalViableArchitecture.pdf
Facade & Landscape Lighting Techniques and Trends.pptx.pdf
AD Bungalow Case studies Sem 2.pptxvwewev
Urban Design Final Project-Context
Wisp Textiles: Where Comfort Meets Everyday Style
EGWHermeneuticsffgggggggggggggggggggggggggggggggg.ppt
Interior Structure and Construction A1 NGYANQI
LITERATURE CASE STUDY DESIGN SEMESTER 5.pptx
WHY_R12 Uaafafafpgradeaffafafafaffff.ppt
DOC-20250430-WA0014._20250714_235747_0000.pptx
mahatma gandhi bus terminal in india Case Study.pptx
Quality Control Management for RMG, Level- 4, Certificate
UNIT 1 Introduction fnfbbfhfhfbdhdbdto Java.pptx.pdf
Chalkpiece Annual Report from 2019 To 2025
EDP Competencies-types, process, explanation
ANATOMY OF ANTERIOR CHAMBER ANGLE AND GONIOSCOPY.pptx
Urban Design Final Project-Site Analysis
CLASS_11_BUSINESS_STUDIES_PPT_CHAPTER_1_Business_Trade_Commerce.pptx

NoSQL Databases: Why, what and when

  • 1. Lorenzo Alberton @lorenzoalberton NoSQL Databases: Why, what and when NoSQL Databases Demystified PHP UK Conference, 25th February 2011 1
  • 3. New Trends 2002 2004 2006 2008 2010 2012 Big data 3
  • 4. New Trends 2002 2004 2006 2008 2010 2012 Big data Concurrency 3
  • 5. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency 3
  • 6. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency Diversity 3
  • 7. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity 3
  • 8. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity Cloud-Grid 3
  • 9. What’s the problem with RDBMS’s? http://www.codefutures.com/database-sharding 4
  • 10. What’s the problem with RDBMS’s? Caching Master/Slave Master/Master Cluster Table Partitioning Federated Tables Sharding http://www.codefutures.com/database-sharding Distributed DBs 4
  • 11. What’s the problem with RDBMS’s? 5
  • 12. What’s the problem with RDBMS’s? http://www.flickr.com/photos/dimi3/3096166092 5
  • 13. Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) 6
  • 14. Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) http://www.flickr.com/photos/42433826@N00/4914337851 6
  • 15. MongoDB is web-scale ...but /dev/null is even better! 7
  • 16. Cassandra is teh schnitz ..Love,v/null .but /de is even better 8
  • 17. CouchDB: Relax! .buve/a LO n ? t de ulfree space ..Love,v/T of!l !? You harenaieth?r Right? l, r x te? isyevay b g t an we 9
  • 18. No, seriously...* (*) Not another “Mine is bigger” comparison, please 10
  • 19. A little theory Fundamental Principles of (Distributed) Databases http://www.timbarcz.com/blog/PassionInProgrammers.aspx 11
  • 20. ACID ATOMICITY: All or nothing CONSISTENCY: Any transaction will take the db from one consistent state to another, with no broken constraints (referential integrity) ISOLATION: Other operations cannot access data that has been modified during a transaction that has not yet completed DURABILITY: Ability to recover the committed transaction updates against any kind of system failure (transaction log) 12
  • 21. Isolation Levels, Locking & MVCC Isolation noun Property that defines how/when the changes made by one operation become visible to other concurrent operations 13
  • 22. Isolation Levels, Locking & MVCC Isolation noun Property that defines how/when the changes made by one operation become visible to other concurrent operations SERIALIZABLE All transactions occur in a completely isolated fashion, as if they were executed serially 13
  • 23. Isolation Levels, Locking & MVCC Isolation noun Property that defines how/when the changes made by one operation become visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result 13
  • 24. Isolation Levels, Locking & MVCC Isolation noun Property that defines how/when the changes made by one operation become visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED A lock is acquired only on the rows currently read/updated 13
  • 25. Isolation Levels, Locking & MVCC Isolation noun Property that defines how/when the changes made by one operation become visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED READ UNCOMMITTED A lock is acquired only on the A transaction can access rows currently read/updated uncommitted changes made by other transactions 13
  • 26. Isolation Levels, Locking & MVCC Non-repeatable Isolation Level Dirty Reads Phantoms reads Serializable - - - Repeatable - - Read Read - Committed Read Uncommitted http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels/ 14
  • 27. Isolation Levels, Locking & MVCC Isolation Level Range Lock Read Lock Write Lock Serializable Repeatable - Read Read - - Committed Read - - - Uncommitted 15
  • 28. Multi-Version Concurrency Control Root Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data 16
  • 29. Multi-Version Concurrency Control obsolete Root new version Index Index Index Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data Data 16
  • 30. Multi-Version Concurrency Control obsolete Root atomic pointer update new version Index Index Index Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data Data 16
  • 31. Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Index Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data Data 16
  • 32. Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Reads: never blocked Index Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data Data 16
  • 33. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Participants 17
  • 34. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Query to commit Participants 17
  • 35. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Participants 1) Exec Transaction up to the COMMIT request 2) Write entry to undo and redo logs 17
  • 36. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Agree or Abort Participants 17
  • 37. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) a) SUCCESS (agreement from all) Participants 18
  • 38. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Commit a) SUCCESS (agreement from all) Participants 18
  • 39. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) a) SUCCESS (agreement from all) Participants 1) Complete operation 2) Release locks 18
  • 40. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge a) SUCCESS (agreement from all) Participants 18
  • 41. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Complete transaction (completion phase) a) SUCCESS (agreement from all) Participants 18
  • 42. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) b) FAILURE (abort from any) Participants 19
  • 43. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Rollback b) FAILURE (abort from any) Participants 19
  • 44. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) b) FAILURE (abort from any) Participants 1) Undo operation 2) Release locks 19
  • 45. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge b) FAILURE (abort from any) Participants 19
  • 46. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Undo transaction (completion phase) b) FAILURE (abort from any) Participants 19
  • 47. Problems with 2PC Blocking Protocol Risk of indefinite cohort Conservative behaviour blocks if coordinator fails biased to the abort case 20
  • 48. Paxos Algorithm (Consensus) Family of Fault-tolerant, distributed implementations Spectrum of trade-offs: Number of processors Number of message delays Activity level of participants Number of messages sent Types of failures http://www.usenix.org/event/nsdi09/tech/full_papers/yabandeh/yabandeh_html/ http://en.wikipedia.org/wiki/Paxos_algorithm 21
  • 50. ACID & Distributed Systems http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
  • 51. ACID & Distributed Systems ACID properties are always desirable But what about: Latency Partition Tolerance High Availability ? http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
  • 52. CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time. http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
  • 53. CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time. http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
  • 54. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • 55. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • 56. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum AP: requests can complete at any live node, possibly violating strong consistency http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • 57. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum HIGH LATENCY AP: requests can complete at any ≈ live node, possiblyPARTITION NETWORK violating strong consistency http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • 58. Consistency: Client-side view A service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window (*) Temporary inconsistencies (e.g. in data constraints or replica versions) are accepted, but they’re resolved at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
  • 59. Consistency: Client-side view A service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window Eventual* consistency (e.g. DNS) Causal consistency Read-your-writes consistency (the least surprise) Session consistency (*) Temporary inconsistencies (e.g. in data constraints or Monotonic read consistency replica versions) are accepted, but they’re resolved Monotonic write consistency at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
  • 60. Consistency: Server-side (Quorum) N = number of nodes with a replica of the data (*) W = number of replicas that must acknowledge the update R = minimum number of replicas that must participate in a successful read operation (*) but the data will be written to N nodes no matter what W+R>N Strong consistency (usually N=3, W=R=2) W = N, R =1 Optimised for reads W = 1, R = N Optimised for writes (durability not guaranteed in presence of failures) W + R <= N Weak consistency 27
  • 61. Amazon Dynamo Paper Consistent Hashing Vector Clocks Gossip Protocol Hinted Handoffs Read Repair http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 28
  • 62. Modulo-based Hashing N1 N2 N3 N4 29
  • 63. Modulo-based Hashing N1 N2 N3 N4 ? 29
  • 64. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers 29
  • 65. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers 29
  • 66. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers Recalculate the hashes for all the entries if n_servers changes (i.e. full data redistribution when adding/removing a node) 29
  • 67. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 68. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 69. Consistent Hashing 2160 0 A canonical home (coordinator node) for key range A-B F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 70. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 71. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 72. Consistent Hashing 2160 0 only the keys in this range change location A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 73. Consistent Hashing - Replication A F B Ring E (key space) D C http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
  • 74. Consistent Hashing - Replication Key hosted AB A in B, C, D F B Data replicated in Ring the N-1 clockwise E (key space) successor nodes D C Node hosting Key , Key , Key FA AB BC http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
  • 75. Consistent Hashing - Node Changes A F B E D C 32
  • 76. Consistent Hashing - Node Changes Key membership A and replicas are updated when a F B node joins or leaves Copy Key the network. Range AB The number of E Copy Key replicas for all data Range FA is kept consistent. D C Copy Key Range EF 32
  • 77. Consistent Hashing - Load Distribution 2160 0 Different Strategies A I Virtual Nodes H B Random tokens per each Ring physical node, partition by C G (key space) token value D Node 1: tokens A, E, G F Node 2: tokens C, F, H E Node 3: tokens B, D, I 33
  • 78. Consistent Hashing - Load Distribution 2160 0 Different Strategies Virtual Nodes Q equal-sized partitions, Ring S nodes, Q/S tokens per (key space) node (with Q >> S) Node 1 Node 2 Node 3 Node 4 ... 34
  • 79. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. Document version history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • 80. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • 81. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document. D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • 82. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document. D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • 83. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document. D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in D5 ([A, 3], [B, 1], [C,1]) ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • 84. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the application or the user. The application might resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • 85. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • 86. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or with D3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • 87. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or with D3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). version mismatch D3 ⊇ D4, conflict detected resolved automatically Vector clocks can grow D5 ([A, 3], [B, 1]) quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • 88. Gossip Protocol + Hinted Handoff A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D C 37
  • 89. Gossip Protocol + Hinted Handoff A I can’t see B, it might be periodic, pairwise, F down but I need some B inter-process ACK. My Merkle Tree root for range XY is interactions of “ab031dab4a385afda” bounded size E among randomly- I can’t see B either. My Merkle Tree root for chosen peers range XY is different! D C B must be down then. Let’s disable it. 37
  • 90. Gossip Protocol + Hinted Handoff My canonical node is supposed to be B. A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D I see. Well, I’ll take care of it for now, and let B know C when B is available again 37
  • 91. Merkle Trees (Hash Trees) Leaves: hashes of ROOT hash(A, B) data blocks. Nodes: hashes of their children. A B hash(C, D) hash(E, F) Used to detect inconsistencies C D E F between replicas hash(001) hash(002) hash(003) hash(004) (anti-entropy) and to minimise the Data Data Data Data Block Block Block Block amount of 001 002 003 004 transferred data http://en.wikipedia.org/wiki/Hash_tree 38
  • 92. Read Repair A F B GET(k, R=2) E D C 39
  • 93. Read Repair A F B GET(k, R=2) E D C 39
  • 94. Read Repair k=XYZ (v.2) A k=XYZ (v.2) F B GET(k, R=2) E D C k=ABC (v.1) 39
  • 95. Read Repair A F B k=XYZ (v.2) E UPDATE(k = XYZ) D C 39
  • 96. NoSQL Break-down Key-value stores, Column Families, Document-oriented dbs, Graph databases 40
  • 97. Focus Of Different Data Models Key-Value Stores Size Column Families Document Databases Graph Databases Complexity http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases 41
  • 98. 1) Key-value stores Amazon Dynamo Paper Data model: collection of key-value pairs 42
  • 99. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • 100. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java HTTP / Sockets API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • 101. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL Conflicts resolved at read HTTP Java and write time Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • 102. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Json, Java String, byte[], Avro Protobuf Thrift, Avro, ProtoBuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • 103. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC Simple optimistic locking for multi-row updates, pluggable storage engine 43
  • 104. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • 105. Membase CP LICENSE DHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
  • 106. Membase CP LICENSE DHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookup http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
  • 107. Membase CP LICENSE DHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookup All metadata kept in memory (high throughput / low latency). Manual/Programmatic failover via the Management REST API. http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
  • 108. Riak AP LICENSE Apache 2 LANGUAGE C, Erlang API/PROTOCOL REST HTTP * ProtoBuf Buckets → K-V “Links” (~relations) Targeted JS Map/Reduce Tune-able consistency (one-quorum-all) 45
  • 109. Redis CP LICENSE K-V store “Data Structures Server” BSD Map, Set, Sorted Set, Linked List LANGUAGE Set/Queue operations, Counters, Pub-Sub, Volatile keys ANSI C API * + PROTOCOL Telnet- like PERSISTENCE 10-100K op/s (whole dataset in RAM + VM) in memory bg snapshots Persistence via snapshotting (tunable fsync freq.) REPLICATION master-slave Distributed if client supports consistent hashing http://redis.io/presentation/Redis_Cluster.pdf 46
  • 110. 2) Column Families Google BigTable paper Data model: big table, column families 47
  • 111. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column column row_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 112. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column column row_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 113. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column column row_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 114. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column column row_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 115. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column column row_key row column family http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 116. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column column row_key row column family ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 117. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column column row_key row column family Atomic updates ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 118. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column column row_key row column family Atomic updates Automatic GC ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 119. Google BigTable: Data Structure SSTable Smallest building block Persistent immutable Map[k,v] Operations: lookup by key / key range scan SSTable 64KB 64KB 64KB block block block lookup index 49
  • 120. Google BigTable: Data Structure SSTable Tablet Smallest building block range of rows Dynamically partitioned Persistent immutable Map[k,v] Built from multiple SSTables Operations: lookup and loadkey range scan Unit of distribution by key / balancing Tablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
  • 121. Google BigTable: Data Structure SSTable Table Tablet Smallest Tablets (table segments) make up a table Multiple building block Dynamically partitioned range of rows Persistent immutable Map[k,v] Built from multiple SSTables Operations: lookup and loadkey range scan Unit of distribution by key / balancing Table Tablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
  • 122. Google BigTable: I/O memtable read memory GFS tablet log SSTable SSTable SSTable write 50
  • 123. Google BigTable: I/O memtable read minor compaction memory GFS tablet log SSTable SSTable SSTable write 50
  • 124. Google BigTable: I/O memtable read minor compaction memory GFS tablet log SSTable SSTable SSTable write BMDiff Zippy 50
  • 125. Google BigTable: I/O memtable read minor compaction memory GFS tablet log SSTable SSTable SSTable write BMDiff Zippy merging / major compaction (GC) 50
  • 126. Google BigTable: Location Dereferencing Metadata Tablets User Tables ... ... Root Tablet Master File ... Chubby ... ... Replicated, persisted Root of the lock service; maintains metadata tree tablet server locations 5 replicas, one elected ... master (via quorum) Up to 3 levels ... Paxos algorithm used in the metadata to keep consistency hierarchy 51
  • 127. Google BigTable: Architecture fs metadata, ACL, GC, load balancing BigTable metadata operations BigTable client master data R/W heartbeat operations messages, GC, chunk migration Tablet Tablet Tablet Chubby Server Server Server track master lock, log of live servers Tablet Tablet Tablet 52
  • 128. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
  • 129. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java ZooKeeper as API/PROTOCOL coordinator REST HTTP Thrift (instead of Chubby) PERSISTENCE memtable/ SSTable 53
  • 130. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Support for Java multiple masters API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
  • 131. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable HDFS, S3, S3N, EBS (with GZip/LZO CF compression) 53
  • 132. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ Data sorted by key SSTable but evenly distributed across the cluster 53
  • 133. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable Batch Streaming, Map/Reduce 53
  • 134. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
  • 135. Hypertable CP LICENSE OSS BigTable implementation Faster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
  • 136. Hypertable CP LICENSE OSS BigTable implementation Faster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Hyperspace Thrift (paxos) used PERSISTENCE instead of memtable/ SSTable ZooKeeper CONCURRENCY MVCC HQL (~SQL) 54
  • 137. Hypertable CP LICENSE OSS BigTable implementation Faster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift Dynamically PERSISTENCE adapts to memtable/ changes in SSTable workload CONCURRENCY MVCC HQL (~SQL) 54
  • 138. Hypertable CP LICENSE OSS BigTable implementation Faster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
  • 139. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE Java PROTOCOL B col_name Thrift Avro col_value PERSISTENCE timestamp memtable/ SSTable Column CONSISTENCY Tunable R/W/N x http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 140. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE super_column_name Java PROTOCOL B col_name col_name Thrift ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N x http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 141. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Column Family Apache 2 LANGUAGE Java PROTOCOL B col_name col_name Thrift row_key ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N x http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 142. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thrift row_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCY keyspace.get("column_family",
key,
["super_column",]
"column") Tunable R/W/N x http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 143. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thrift row_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCY keyspace.get("column_family",
key,
["super_column",]
"column") Tunable B R/W/N A C P2P F Gossip x D E http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 144. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thrift row_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCY keyspace.get("column_family",
key,
["super_column",]
"column") Tunable B R/W/N A C ALL P2P ONE F Gossip x QUORUM D E http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 145. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thrift row_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCY keyspace.get("column_family",
key,
["super_column",]
"column") Tunable B R/W/N A RandomPartitioner (MD5) C ALL P2P OrderPreservingPartitioner ONE F Gossip x QUORUM E D Range Scans, Fulltext Index (Solandra) http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 146. 3) Document DBs Lotus Notes Data model: collection of K-V collections 56
  • 147. CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only B+Tree CONCURRENCY MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • 148. CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” (materialised resultset) Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only B+Tree CONCURRENCY MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • 149. CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only B+Tree CONCURRENCY MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • 150. CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale CONCURRENCY transactions volume MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • 151. CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale Conflict Resolution CONCURRENCY transactions volume at application level MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • 152. CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale Conflict Resolution MVCC (copy-on-modify) CONCURRENCY transactions volume at application level Volatile Versioning MVCC CONSISTENCY crash-only design REPLICATION multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • 153. CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale Conflict Resolution MVCC (copy-on-modify) CONCURRENCY transactions volume at application level Volatile Versioning MVCC CONSISTENCY crash-only design Online Compaction REPLICATION (very primitive VACUUM) multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • 154. CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale Conflict Resolution MVCC (copy-on-modify) CONCURRENCY transactions volume at application level Volatile Versioning MVCC CONSISTENCY crash-only design Online Compaction Update validation / REPLICATION (very primitive VACUUM) Auth triggers multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • 155. CouchDB AP LICENSE Apache 2 LANGUAGE JSON docs map-reduce “views” Storage + View Indexes (B+Tree) (materialised resultset) [by_id_index, by_seqnum_index] Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only Replication used B+Tree as a way to scale Conflict Resolution MVCC (copy-on-modify) CONCURRENCY transactions volume at application level Volatile Versioning MVCC CONSISTENCY crash-only design Online Compaction Update validation / Delayed commits REPLICATION (very primitive VACUUM) Auth triggers (write performance) multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 57
  • 156. CouchDB AP LICENSE MVCC consequences: Apache 2 compaction load, disk space LANGUAGE Erlang API/PROTOCOL REST/JSON PERSISTENCE Append-only B+Tree CONCURRENCY MVCC CONSISTENCY crash-only http://enda.squarespace.com/tech/2009/12/8/couchdb-compaction-big-impacts.html design REPLICATION http://chesnok.com/talks/mvcc_couchcamp.pdf (PgSQL VACUUM) multi-master http://horicky.blogspot.com/2008/10/couchdb-implementation.html 58
  • 157. MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 BSON serialisation LANGUAGE (storage & transfer) C++ API/PROTOCOL REST/BSON * PERSISTENCE B+Tree, Snapshots CONCURRENCY In-place updates REPLICATION master-slave replica sets 59
  • 158. MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGE BSON serialisation Master-Slave, (storage & transfer) C++ Auto-Failover API/PROTOCOL REST/BSON * PERSISTENCE B+Tree, Snapshots CONCURRENCY In-place updates REPLICATION master-slave replica sets 59
  • 159. MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGE BSON serialisation B-Tree Indexes Master-Slave, (storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON * PERSISTENCE B+Tree, Snapshots CONCURRENCY In-place updates REPLICATION master-slave replica sets 59
  • 160. MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGE BSON serialisation B-Tree Indexes Master-Slave, (storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCE Update in place B+Tree, Snapshots (no versioning, no CONCURRENCY append-only log) In-place updates REPLICATION master-slave replica sets 59
  • 161. MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGE BSON serialisation B-Tree Indexes Master-Slave, (storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCE Update in place B+Tree, Geo-Spatial Indexes Snapshots (no versioning, no CONCURRENCY append-only log) In-place updates REPLICATION master-slave replica sets 59
  • 162. MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGE BSON serialisation B-Tree Indexes Master-Slave, (storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCE Update in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots (no versioning, no Replication + CONCURRENCY append-only log) Snapshotting In-place updates REPLICATION master-slave replica sets 59
  • 163. MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGE BSON serialisation B-Tree Indexes Master-Slave, (storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCE Update in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots (no versioning, no Replication + CONCURRENCY append-only log) Snapshotting In-place updates GROUP BY REPLICATION master-slave Map/Reduce replica sets (well, aggregation) 59
  • 164. MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGE BSON serialisation B-Tree Indexes Master-Slave, (storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCE Update in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots (no versioning, no Replication + CONCURRENCY append-only log) Snapshotting In-place updates GROUP BY REPLICATION master-slave Map/Reduce No ACK on Updates replica sets (well, aggregation) (or ensure N replicas) 59
  • 165. MongoDB CP LICENSE bson_encode() bson_decode() AGPLv3 Auto-Sharding, LANGUAGE BSON serialisation B-Tree Indexes Master-Slave, (storage & transfer) (on different cols too) C++ Auto-Failover API/PROTOCOL REST/BSON v.1 * v.2 PERSISTENCE Update in place Persistence via B+Tree, Geo-Spatial Indexes Snapshots (no versioning, no Replication + CONCURRENCY append-only log) Snapshotting In-place updates GROUP BY REPLICATION New! Improved! master-slave Map/Reduce --dur flag No ACK on Updates replica sets (well, aggregation) (or ensure N replicas) 59
  • 166. 4) Graph databases Graph Theory 60
  • 167. Neo4j LICENSE AGPLv3 / Commercial LANGUAGE Java Nodes, Vertical Scalability API/PROTOCOL Graph Data Structure Relationships, (1000s times faster, but not distributed) REST Properties on both Java SPARQL PERSISTENCE On-disk Physical structure: linked-list LinkedList stored on disk HA cluster with ZooKeeper High-performance SPARQL (nodes = exact replicas) node traversal http://docs.neo4j.org/chunked/stable/ha-how.html 61
  • 168. Neo4j LICENSE NeoService neo = ... // factory AGPLv3 / Commercial LANGUAGE Transaction tx = neo.beginTx(); Java Nodes, Vertical Scalability Node Structure n1 = neo.createNode(); (1000s times faster, API/PROTOCOL Graph Data Relationships, n1.setProperty(“name”, “John”); but not distributed) REST Properties on both Java n1.setProperty(“age”, 35); SPARQL PERSISTENCE Node n2 = neo.createNode(); n2.setProperty(“name”, “Mary”); structure: On-disk Physical linked-list n2.setProperty(“age”, 29); LinkedList stored on disk n2.setProperty(“job”, “engineer”); n1.createRelationshipTo(n2, RelTypes.KNOWS); HA cluster with ZooKeeper tx.commit(); High-performance SPARQL (nodes = exact replicas) node traversal http://docs.neo4j.org/chunked/stable/ha-how.html 61
  • 169. Neo4j LICENSE Traverser neo = ... // factory NeoServicefriendTraverser = n1.traverse( AGPLv3 / Traverser.order.BREADTH_FIRST, Commercial LANGUAGE Transaction tx = neo.beginTx(); StopEvaluator.END_OF_GRAPH, ReturnableEvaluator.ALL_BUT_START_NODE, Java Nodes, Vertical Scalability GraphNodeRelTypes.KNOWS,Relationships, n1 = neo.createNode(); API/PROTOCOL Data Structure (1000s times faster, n1.setProperty(“name”, “John”); but not distributed) Direction.OUTGOING REST Properties on both Java ); n1.setProperty(“age”, 35); SPARQL // Traverse the node space PERSISTENCE System.out.println(“John’s Node n2 = neo.createNode();friends: ”); for (Node friend : friendsTraverser) { n2.setProperty(“name”, “Mary”); structure: On-disk Physical linked-list n2.setProperty(“age”, 29);depth stored on disk System.out.printf(“At LinkedList %d => %s%n”, n2.setProperty(“job”, “engineer”); friendTraverser.currentPosition(). getDepth(), n1.createRelationshipTo(n2, RelTypes.KNOWS); friendTraverser.getProperty(“name”) ); HA cluster with ZooKeeper } tx.commit(); High-performance SPARQL (nodes = exact replicas) node traversal http://docs.neo4j.org/chunked/stable/ha-how.html 61
  • 170. Final Considerations Query modes Achievements and Problems 62
  • 171. Query Modes: a new “SQL”? Map/Reduce 63
  • 172. SQL vs. Map/Reduce SELECT 19OPQ db.runCommand({ A*2=*LR Dim1, Dim2, ! mapreduce: "DenormAggCollection", SUM(Measure1) AS MSum, query: { " COUNT(*) AS RecordCount, filter1: { '$in': [ 'A', 'B' ] }, AVG(Measure2) AS MAvg, # filter2: 'C', MIN(Measure1) AS MMin filter3: { '$gt': 123 } MAX(CASE }, WHEN Measure2 < 100 $ map: function() { emit( THEN Measure2 { d1: this.Dim1, d2: this.Dim2 }, END) AS MMax { msum: this.measure1, recs: 1, mmin: this.measure1, FROM DenormAggTable mmax: this.measure2 < 100 ? this.measure2 : 0 } WHERE (Filter1 IN (’A’,’B’)) );}, AND (Filter2 = ‘C’) % reduce: function(key, vals) { AND (Filter3 > 123) var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 }; GROUP BY Dim1, Dim2 ! for(var i = 0; i < vals.length; i++) { HAVING (MMin > 0) ret.msum += vals[i].msum; ORDER BY RecordCount DESC ret.recs += vals[i].recs; LIMIT 4, 8 if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin; if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax)) ret.mmax = vals[i].mmax; } ! ()*+,-./.01-230*2/4*5+123/6)-/,+55-./ return ret; *+7/63/8-93/02/7:-/16,/;+2470*2</ }, )-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@ finalize: function(key, val) { ' " A-63+)-3/1+37/B-/162+6559/6==)-=67-.@ & val.mavg = val.msum / val.recs; return val; # C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/ }, G-E030*2/$</M)-67-./"N!NIN#IN' G048/F3B*)2-</)048*3B*)2-@*)= 1+37/?607/+2705/;02650>670*2@ $ A-63+)-3/462/+3-/,)*4-.+)65/5*=04@ out: 'result1', verbose: true % D057-)3/:6E-/62/FGAHC470E-G-4*).I }); 5**802=/3795-@ db.result1. ' C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/ 7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@ find({ mmin: { '$gt': 0 } }). & C34-2.02=J/!K/L-34-2.02=J/I! sort({ recs: -1 }). skip(4). limit(8); http://rickosborne.org/download/SQL-to-MongoDB.pdf 64
  • 175. Data model, Relations and Consistency A step backwards? 66
  • 176. Data model, Relations and Consistency A step backwards? Scalability, availability and resilience come at a cost 66
  • 177. Big Data collect store organise analyse share Werner Vogels, CTO, Amazon - STRATA Conf 2011 67
  • 178. Big Data collect we don’t always know up-front store which questions we’re going to ask organise analyse share Werner Vogels, CTO, Amazon - STRATA Conf 2011 67
  • 180. Lorenzo Alberton @lorenzoalberton Thank you! [email protected] http://www.alberton.info/talks http://joind.in/talk/view/2517 69

Editor's Notes