SlideShare a Scribd company logo
Big problems,
 Massive data

Stratified B-trees
Versioned dictionaries

•   put(k,ver,data)
                                          Monday 12:00   v10


•   get(k_start,k_end,ver)

•   clone(v): create a child of v
                                          Monday 16:00   v11

    that inherits the latest version of
    its keys
                                             Now         v12
Versioned dictionaries

•   put(k,ver,data)
                                          Monday 12:00   v10


•   get(k_start,k_end,ver)

•   clone(v): create a child of v
                                          Monday 16:00   v11

    that inherits the latest version of
    its keys
                                             Now         v12



    This talk: a versioned dictionary with fast updates,
    and optimal space/query/update tradeoffs
Why?
                           •    Powerful: cloning, time-travel,
                                cache and space-efficiency, ...
Monday 12:00   v10

                           •    Give developers a recent
                                branch of live dataset
Monday 16:00   v11
                           •    Expose different views of same
                                base dataset

   Now         v12   v13
                               Run analytics/tests/etc on
                                  this clone, without
                                 performance impact.
State of the art: copy-on-write
  Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to
  the B-tree
State of the art: copy-on-write
  Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to
  the B-tree
                               Problems:
                            • Space blowup: Each update may
                               rewrite an entire path
                            • Slow updates: as above
                            • Needs random IO to scale
                            • Concurrency is tricky
State of the art: copy-on-write
  Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to
  the B-tree
                               Problems:
                            • Space blowup: Each update may
                               rewrite an entire path
                            • Slow updates: as above
                            • Needs random IO to scale
                            • Concurrency is tricky
 A log file system makes updates sequential, but relies on
 garbage collection (achilles heel!)
~ log (2^30)/log 10000
       = 3 IOs/update

                               CoW B-tree
                            [ZFS,WAFL,Btrfs,..]
                                O(logB Nv)
       Update
                               random IOs

     Range query
                             O(Z/B) random
       (size Z)

        Space                O(N B logB Nv)


Nv = #keys live (accessible) at version v
B = “block size”, say 1MB at 100 bytes/entry = 10000 entries
complication: B is asymmetric for flash..
important for flash
   ~ log (2^30)/log 10000                                    ~ log (2^30)/10000
       = 3 IOs/update                                        = 0.003 IOs/update

                               CoW B-tree
                                                             This talk
                            [ZFS,WAFL,Btrfs,..]
                                O(logB Nv)             O((log Nv) / B)
       Update
                               random IOs            cache-oblivious IOs

     Range query
                             O(Z/B) random               O(Z/B) sequential
       (size Z)

        Space                O(N B logB Nv)                   O(N)


Nv = #keys live (accessible) at version v
B = “block size”, say 1MB at 100 bytes/entry = 10000 entries
complication: B is asymmetric for flash..
Unversioned Case
[Doubling Array]
Doubling Array
                      Inserts




 Buffer arrays in memory
until we have > B of them
Doubling Array
                      Inserts


   2




 Buffer arrays in memory
until we have > B of them
Doubling Array
                      Inserts


   2


   9




 Buffer arrays in memory
until we have > B of them
Doubling Array
                      Inserts


   2


   9




 Buffer arrays in memory
until we have > B of them
Doubling Array
                      Inserts


            2    9




 Buffer arrays in memory
until we have > B of them
Doubling Array
        Inserts


2   9
Doubling Array
             Inserts


11   2   9
Doubling Array
             Inserts


11   2   9


8
Doubling Array
         Inserts


2   9


8   11
Doubling Array
         Inserts


2   9


8   11
Doubling Array
                       Inserts


                        2   8   9   11


                                               etc...


Similar to log-structured merge trees (LSM), cache-
oblivious lookahead array (COLA), ...
O(log N) “levels”, each element is rewritten once per level
  O((log N) / B) IOs
Doubling Array
    Queries
Doubling Array
                 Queries




• Add an index to each array to do lookups
Doubling Array
                 Queries




              query(k)

• Add an index to each array to do lookups
• query(k) searches each array independently
Doubling Array
                   Queries




                query(k)

• Bloom Filters can help exclude arrays from
  search
• ... but don’t help with range queries
Fractional Cascading
Fractional Cascading



• Fractional Cascading:
    Use information from search at level l
          to help search at level l+1

• From each array, sample every 4th element
  and put a pointer to it in previous level
Fractional Cascading
                  found entry




• Fractional Cascading:
    Use information from search at level l
          to help search at level l+1

• From each array, sample every 4th element
  and put a pointer to it in previous level
Fractional Cascading
                  found entry




                  ‘forward pointers’ give bounds for search in next array


• Fractional Cascading:
    Use information from search at level l
          to help search at level l+1

• From each array, sample every 4th element
  and put a pointer to it in previous level
Fractional Cascading



 forward pointer
 data
Fractional Cascading


search
Fractional Cascading


search
Fractional Cascading


 search
Fractional Cascading


          search
Fractional Cascading


                       search
Fractional Cascading
Fractional Cascading
• In case you might get unlucky with the
  sampling...
Fractional Cascading
• In case you might get unlucky with the
  sampling...



• ... add regular ‘secondary’ pointers to nearest
  FP above and below
Versioned case (sketch)
Adding versions
                  version 1

k1   k2   k3k4   k5   k6   k7   k8   k9 k10 k11 k12 k13


if layout is good for v1 ...


                                                          v1




                                                          v2
Adding versions
                  version 1

k1   k2   k3k4   k5   k6   k7   k8   k9 k10 k11 k12 k13 k6

                                                             version 2
if layout is good for v1 ...
                       ... then it’s bad for v2

                                                                         v1




                                                                         v2
Adding versions
                               version 1

         k1   k2   k3k4   k5    k6   k7   k8   k9 k10 k11 k12 k13 k6

                                                                              version 2
        if layout is good for v1 ...
                               ... then it’s bad for v2

if you try to keep all versions of a key close...                                         v1

         k1   k2   k3     k4   k5    k6   k6   k7   k8   k9 k10 k11 k12 k13



                                                                                          v2
Adding versions
                             version 1

       k1    k2   k3   k4   k5   k6   k7   k8   k9 k10 k11 k12 k13 k6

                                                                        version 2
            if layout is good for v1 ...
                                   ... then it’s bad for v2

if you try to keep all versions of a key close...
          k1 k2 k3 k4 k5 k6 k6 k6 k6 k6 ... k7 k8 k9                    k10 k11 k12 k13



                                                       ... then it’s bad for all versions
     versions 2, 3, 4, ...
Density
                           k0      k1      k2     k3

                    v0

                    v4                                                             v0

                    v5
                                                                                   v4   v5
                   v1                                                    v1

                   v2
                                                                    v2        v3
                   v3



          W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x




•   Arrays are tagged with a version set W
Density
                                     k0      k1      k2     k3      live at v1
                              v0
    live(v1) = 4
    live(v2) = 4              v4
                                                                 live at v3                  v0
    live(v3) = 4
    density = 4/8             v5
                                                                                             v4   v5
                             v1                                                    v1

                             v2
                                                                              v2        v3
                             v3



                    W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x


•    f(A,v) = (#elements in A live at version v) / |A|

•    density(A,W) = min{w in W} f(A,w)
Density
                                     k0      k1      k2     k3      live at v1
                              v0
    live(v1) = 4
    live(v2) = 4              v4
                                                                 live at v3                  v0
    live(v3) = 4
    density = 4/8             v5
                                                                                             v4   v5
                             v1                                                    v1

                             v2
                                                                              v2        v3
                             v3



                    W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x


•    f(A,v) = (#elements in A live at version v) / |A|

•    density(A,W) = min{w in W} f(A,w)
•    We say the array (A,W) is dense if density ≥1/5

•    Tradeoff: high density means good range queries, but many duplicates
     (imagine density 1 and density 1/N)
optimal bound of O(log Nv + Z/B). For much smaller range
queries, the worst-case performance may be the same as for
a point query. We now prove the amortized bound, which
                   Range queries
applies to smaller queries.


  Theorem 2. A range query at version v costs O(log Nv +
Z/B) amortized I/Os.
                       (k,*)
   Proof. We first consider just point queries, and amortize
the cost of lookup(k, v) over all keys live at v. Let l(k, v) be
    •
the cost of lookup(k, v), then the amortized cost is given by
        imagine scanning over each accessible array
   k l(k, v)/Nv .
    •   density => trivially true for large (‘voluminous’) range queries
    •
For anfor point queries: v, Ai ) be the number of I/Os used
         array Ai , let l(k,
in examining elements in Ai for lookup(k,v). The idea is
        •  amortize over all k for a fixed version v
        •   each query examines disjoint regions of the array
        •   density implies total size examined = O(Nv log Nv)
Don’t worry, stay dense!
       •   Version sets disjoint at each level -- lookups examine one array/level

       •   merge arrays with intersecting version sets

       •   the result of a merge might not be dense

       •   Answer: density amplification!


      promote              merge            density amplification            demote
...                                                                                  ...
                   {1,2}                                            {2,3}
                                             {1,2,3}

                   {1,3}                                             {1}

                                           {4}

                   {4}                                               {4}
“density amplification”
                                                         k0   k1   k2   k3
       live(v0) = 2                                 v0
      density = 2/11
                                                    v4                       live(v0) = 2
      k0        k1        k2        k3                                       live(v5) = 4
                                          split 1   v5
v0
                                                    v1                       density = 2/4
v4
                                                    v2
v5
                                                    v3
v1

v2

v3                                                       k0   k1   k2   k3

                                                    v0
                                                                             live(v4) = 2
                     v0
                                          split 2   v4                       live(v1) = 3
                                                                             live(v2) = 3
                                                    v5
                                                                             live(v3) = 3
                     v4        v5                   v1
           v1                                                                density = 2/7
                                split 1             v2

                                                    v3
     v2      v3
     split 2
e-    If (A, V ) also satisfies (L-live) then every split of it does
      (since all live elements are included), and likewise for (L-
         “density amplification”
r-
 h    edge). It follows that version splitting (A , V ) – which
m     necessarily has no promotable versions – results in a set of
      arrays all of which satisfy all of the L-* conditions necessary
                                                 k0 k1  k2 k3
      to stay atlive(v0) = 2
                   level l.
                                                  v0
 s,             density = 2/11
                                                  v4                       live(v0) = 2
he    The main result of k3
             k0 k1  k2    this process is the following.                   live(v5) = 4
                                      split 1     v5
al         v0
                                                  v1                      density = 2/4
 n         v4

ut      Lemma 3 (Promotion). T he fraction of lead elements
           v5
                                            v2


e,    over v1 l output arrays after a version split is ≥ 1/39.
            al                              v3


           v2

           v3
         Proof. First, we claim that under k0 k1same conditions
                                                        the    k2  k3

 st   as the version split lemma, if in addition |A| < 2M live(v4) = 2
                                                    v0                    and
  n                                       split 2
      live(v) >= M/3 for all v, then the number of output strata = 3
                         v0                         v4                   live(v1)
re                                                                       live(v2) = 3
      is at most 13. Consider the arrays which obey the live(v3) = 3
                                                    v5                   lead
  o                      v4 v5
      fraction constraint. Each has sizev1at least M/3, since at
                   v1
ng    least one version is   split live in it, and least half of the array is= 2/7
                                   1                v2
                                                                        density

 d    lead, sov2at least M/6 lead keys. The total number of lead
                                                    v3
                      v3
re    keys in split 2 array A is ≤ 2M , since the array itself is no
               the
ui    larger than this; it follows that there can be no more than
O n snapshot or clone of version v to new descendant ver-            ou tpu t
sion v , v is registered for each array A which is currently


3.9
            Update bound
registered to the parent of v. T his does not require any I / Os.

       Update
                                                                     T he th
                                                                     rays ca
                                                                     ting.
  Theorem 1. The stratified doubling array performs up-
dates to a leaf version v in a cache-oblivious O (log N v / B )      3.10
amortized I/Os.
                                                                     For lar
                                                                     Z = Ω
   Proof. A ssume we have at our disposal a memory buffer             proper
of size at least B (recall that B is not known to the algo-          op tima
rithm). T hen each array that is involved in a disk merge            queries
has size at least B , so a merge of some number of arrays of         a poin
total size k elements costs O (k / B ) I / Os. In the C O L A [5],   applies
each element exists in exactly one array and may participate
in O (log N ) merges, which immediately gives the desired
amortized bound. In the scheme described here, elements                The
may exist in many arrays, and elements may participate in            Z/B) a
many merges at the same level (eg when an array at level
l is version split and some subarrays remain at level l after          Pro
the version split). N evertheless, we shall prove the theorem        the cos
O n snapshot or clone of version v to new descendant ver-           ou tpu t
sion v , v is registered for each array A which is currently


3.9
             Update bound
registered to the parent of v. T his does not require any I / Os.

        Update
                                                                    T he th
                                                                    rays ca
                                                                    ting.
  Theorem 1. The stratified doubling array performs up-
dates to a leaf version v in a cache-oblivious O (log N v / B )     3.10
amortized I/Os.
                                                                   For lar
                                                                   Z = Ω
•    Not possible to use basic amortized method (some elements in
   Proof.arrays; somehave at ourmerged many times)
              A ssume we elements disposal a memory buffer          proper
     many
of size at least B (recall that B is not known to the algo-        op tima
•
rithm). T hen each array of merges/splits to leaddisk merge only queries
     Idea: charge the cost that is involved in a elements
    •   (k,v) appears as lead in of some array -> always N total leadpoin
has size at least B , so a merge exactly 1 number of arrays of     a
total size k elements costs O (k / B ) I / Os. In the C O L A [5], applies
    •
each element exists in exactly one array andpromotion
        each lead element receives $c/B on may participate
    •
in O (log N ) merges, which immediately v / B) the desired
        total charge for version v is O(log N gives
amortized bound. In the scheme described here, elements               The
may exist in many arrays, and elements may participate in          Z/B) a
many merges at the same level (eg when an array at level
l is version split and some subarrays remain at level l after         Pro
the version split). N evertheless, we shall prove the theorem      the cos
9: return [split(r)]
 O n snapshot or clone of version v to new descendant ver-             ou tpu t
 sion v , v is registered for each array A which is currently

                Update bound
 registered to the parent of v. T his does not require any I / Os.
there is a version split of (A, V ), say (Ai , Vi ) for i = 1 . . . n,
such that each array satisfies ( L-dense) and ( L-size) for level
                                                                       T he th
                                                                       rays ca
l, and Updateat most one index i for which lead(Ai ) <
 3.9 there is                                                          ting.
|AiTheorem 1. The stratified doubling array performs up-
     |/2.
 dates to a leaf version v in a cache-oblivious O (log N v / B )       3.10
 amortized I/Os.
                                                                       For lar
If (A, V ) also satisfies (L-live) then every split of it does          Z = Ω
•
(since all live elements basic amortized method (some elements in
      Not possible to use are included), and likewise for (L-
    Proof.arrays; somehave at ourmerged many times)
                A ssume we elements disposal a memory buffer            proper
      many
edge). It follows that version splitting (A , V ) – which
 of size at least B (recall that B is not known to the algo-           op tima
•
necessarily has no promotable versions – results in a set of only
 rithm). T hen each array of merges/splits to leaddisk merge
      Idea: charge the cost that is involved in a elements
arrays all of which satisfy all of the L-* conditions necessary
                                                                       queries
    •     (k,v) appears as lead in of some array -> always N total leadpoin
 has size at least B , so a merge exactly 1 number of arrays of
to stay at level l.
                                                                       a
 total size k elements costs O (k / B ) I / Os. In the C O L A [5],    applies
    •   element exists in exactly one array andpromotion
          each lead element receives $c/B on may
 eachmain result of this process is the following. participate
The
    •
 in O (log N ) merges, which immediately v / B) the desired
          total charge for version v is O(log N gives
 amortized bound. In the scheme described here, elements                  The
 may exist 3 many arrays, andTelements may lead elements
               in (Promotion). he fraction of participate in           Z/B) a
    Lemma
over al merges at the same level (eg when is ≥array at level
 many l output arrays after a version split an 1/39.
 l is version split and some subarrays remain at level l after            Pro
 the version split). N evertheless, we shall prove the theorem         the cos
Does it work?
Insert rate, as a function of dictionary size




                     1e+06




                     100000
Inserts per second




                     10000




                      1000




                        100       Stratified B-tree
                                      CoW B-tree

                              1                                                                   10
                                                                    Keys (millions)
                                                                                                       ~3 OoM
Range rate, as a function of dictionary size
                   1e+09




                   1e+08
Reads per second




                   1e+07




                   1e+06




                   100000


                                Stratified B-tree
                                    CoW B-tree
                    10000
                            1                                                                 10
                                                                 Keys (millions)
                                                                                                   ~1 OoM
bitbucket.org/acunu
                                          www.acunu.com/download




Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and
elephant logos are trademarks of the Apache Software Foundation.

More Related Content

PDF
Seastore: Next Generation Backing Store for Ceph
PPTX
RocksDB compaction
PPTX
Linux Memory Management
PPTX
The Alfresco ECM 1 Billion Document Benchmark on AWS and Aurora - Benchmark ...
PPTX
Storage as a service and OpenStack Cinder
PDF
Managing terabytes: When PostgreSQL gets big
PDF
MongoDB WiredTiger Internals
PDF
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Seastore: Next Generation Backing Store for Ceph
RocksDB compaction
Linux Memory Management
The Alfresco ECM 1 Billion Document Benchmark on AWS and Aurora - Benchmark ...
Storage as a service and OpenStack Cinder
Managing terabytes: When PostgreSQL gets big
MongoDB WiredTiger Internals
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...

What's hot (20)

PPTX
A Technical Introduction to WiredTiger
PPTX
Concurrent Prefix Recovery: Performing CPR on a Database
PDF
Performance optimization for all flash based on aarch64 v2.0
PDF
MySQL NDB Cluster 8.0 SQL faster than NoSQL
PDF
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
PDF
Kernel Recipes 2013 - Overview display in the Linux kernel
PPTX
Kafka Tutorial - DevOps, Admin and Ops
PDF
Decompressed vmlinux: linux kernel initialization from page table configurati...
PDF
malloc & vmalloc in Linux
PDF
Getting Started with HBase
PPTX
RocksDB detail
PDF
Profiling your Applications using the Linux Perf Tools
PPTX
NFV Orchestration for Telcos using OpenStack Tacker
PDF
Hardening Kafka Replication
PDF
reference_guide_Kernel_Crash_Dump_Analysis
PPTX
InfluxDB Roadmap: What’s New and What’s Coming
PDF
4章 Linuxカーネル - 割り込み・例外 5
PDF
Kernel_Crash_Dump_Analysis
PDF
Ext4 filesystem(1)
PPTX
V12 TLS証明書管理の自動化
A Technical Introduction to WiredTiger
Concurrent Prefix Recovery: Performing CPR on a Database
Performance optimization for all flash based on aarch64 v2.0
MySQL NDB Cluster 8.0 SQL faster than NoSQL
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2013 - Overview display in the Linux kernel
Kafka Tutorial - DevOps, Admin and Ops
Decompressed vmlinux: linux kernel initialization from page table configurati...
malloc & vmalloc in Linux
Getting Started with HBase
RocksDB detail
Profiling your Applications using the Linux Perf Tools
NFV Orchestration for Telcos using OpenStack Tacker
Hardening Kafka Replication
reference_guide_Kernel_Crash_Dump_Analysis
InfluxDB Roadmap: What’s New and What’s Coming
4章 Linuxカーネル - 割り込み・例外 5
Kernel_Crash_Dump_Analysis
Ext4 filesystem(1)
V12 TLS証明書管理の自動化
Ad

Similar to 2011.06.20 stratified-btree (20)

PDF
Stratified B-trees - HotStorage11
PDF
NoSQL - how it works (@pavlobaron)
PPTX
PersistentArraysThatare required in order to learn
PPT
My File Structure Btrees Project Report
DOC
Technical questions
PDF
12111 data structure
PDF
Ctrie Data Structure
PDF
Plc (1)
DOC
Technical aptitude questions
DOC
Data structures project
PPT
DOC
Cs&it technical interview questions
PDF
Questions datastructures-in-c-languege
PPTX
Lecture 16 data structures and algorithms
PPT
lecture 17
PDF
Basics in algorithms and data structure
PDF
MongoDB and Fractal Tree Indexes
PDF
20121024 mongodb-boston (1)
PPTX
algoritmagraph_breadthfirstsearch_depthfirstsearch.pptx
Stratified B-trees - HotStorage11
NoSQL - how it works (@pavlobaron)
PersistentArraysThatare required in order to learn
My File Structure Btrees Project Report
Technical questions
12111 data structure
Ctrie Data Structure
Plc (1)
Technical aptitude questions
Data structures project
Cs&it technical interview questions
Questions datastructures-in-c-languege
Lecture 16 data structures and algorithms
lecture 17
Basics in algorithms and data structure
MongoDB and Fractal Tree Indexes
20121024 mongodb-boston (1)
algoritmagraph_breadthfirstsearch_depthfirstsearch.pptx
Ad

More from Acunu (20)

PDF
Acunu and Hailo: a realtime analytics case study on Cassandra
PDF
Virtual nodes: Operational Aspirin
PDF
Acunu Analytics and Cassandra at Hailo All Your Base 2013
PDF
Understanding Cassandra internals to solve real-world problems
PDF
Acunu Analytics: Simpler Real-Time Cassandra Apps
PDF
All Your Base
PDF
Realtime Analytics with Apache Cassandra
PDF
Realtime Analytics with Apache Cassandra - JAX London
PDF
Real-time Cassandra
PDF
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
PDF
Realtime Analytics with Cassandra
PDF
Acunu Analytics @ Cassandra London
KEY
Exploring Big Data value for your business
PDF
Realtime Analytics on the Twitter Firehose with Cassandra
PDF
Progressive NOSQL: Cassandra
PPTX
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
KEY
Cassandra EU 2012 - Putting the X Factor into Cassandra
PPTX
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
PDF
Next Generation Cassandra
PDF
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Acunu and Hailo: a realtime analytics case study on Cassandra
Virtual nodes: Operational Aspirin
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Understanding Cassandra internals to solve real-world problems
Acunu Analytics: Simpler Real-Time Cassandra Apps
All Your Base
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra - JAX London
Real-time Cassandra
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics with Cassandra
Acunu Analytics @ Cassandra London
Exploring Big Data value for your business
Realtime Analytics on the Twitter Firehose with Cassandra
Progressive NOSQL: Cassandra
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Next Generation Cassandra
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans

Recently uploaded (20)

PDF
Getting Started with Data Integration: FME Form 101
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
A Presentation on Touch Screen Technology
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Getting Started with Data Integration: FME Form 101
DP Operators-handbook-extract for the Mautical Institute
A Presentation on Touch Screen Technology
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
OMC Textile Division Presentation 2021.pptx
cloud_computing_Infrastucture_as_cloud_p
gpt5_lecture_notes_comprehensive_20250812015547.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Univ-Connecticut-ChatGPT-Presentaion.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A novel scalable deep ensemble learning framework for big data classification...
Heart disease approach using modified random forest and particle swarm optimi...
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
A Presentation on Artificial Intelligence
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf

2011.06.20 stratified-btree

  • 1. Big problems, Massive data Stratified B-trees
  • 2. Versioned dictionaries • put(k,ver,data) Monday 12:00 v10 • get(k_start,k_end,ver) • clone(v): create a child of v Monday 16:00 v11 that inherits the latest version of its keys Now v12
  • 3. Versioned dictionaries • put(k,ver,data) Monday 12:00 v10 • get(k_start,k_end,ver) • clone(v): create a child of v Monday 16:00 v11 that inherits the latest version of its keys Now v12 This talk: a versioned dictionary with fast updates, and optimal space/query/update tradeoffs
  • 4. Why? • Powerful: cloning, time-travel, cache and space-efficiency, ... Monday 12:00 v10 • Give developers a recent branch of live dataset Monday 16:00 v11 • Expose different views of same base dataset Now v12 v13 Run analytics/tests/etc on this clone, without performance impact.
  • 5. State of the art: copy-on-write Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to the B-tree
  • 6. State of the art: copy-on-write Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as above • Needs random IO to scale • Concurrency is tricky
  • 7. State of the art: copy-on-write Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as above • Needs random IO to scale • Concurrency is tricky A log file system makes updates sequential, but relies on garbage collection (achilles heel!)
  • 8. ~ log (2^30)/log 10000 = 3 IOs/update CoW B-tree [ZFS,WAFL,Btrfs,..] O(logB Nv) Update random IOs Range query O(Z/B) random (size Z) Space O(N B logB Nv) Nv = #keys live (accessible) at version v B = “block size”, say 1MB at 100 bytes/entry = 10000 entries complication: B is asymmetric for flash..
  • 9. important for flash ~ log (2^30)/log 10000 ~ log (2^30)/10000 = 3 IOs/update = 0.003 IOs/update CoW B-tree This talk [ZFS,WAFL,Btrfs,..] O(logB Nv) O((log Nv) / B) Update random IOs cache-oblivious IOs Range query O(Z/B) random O(Z/B) sequential (size Z) Space O(N B logB Nv) O(N) Nv = #keys live (accessible) at version v B = “block size”, say 1MB at 100 bytes/entry = 10000 entries complication: B is asymmetric for flash..
  • 11. Doubling Array Inserts Buffer arrays in memory until we have > B of them
  • 12. Doubling Array Inserts 2 Buffer arrays in memory until we have > B of them
  • 13. Doubling Array Inserts 2 9 Buffer arrays in memory until we have > B of them
  • 14. Doubling Array Inserts 2 9 Buffer arrays in memory until we have > B of them
  • 15. Doubling Array Inserts 2 9 Buffer arrays in memory until we have > B of them
  • 16. Doubling Array Inserts 2 9
  • 17. Doubling Array Inserts 11 2 9
  • 18. Doubling Array Inserts 11 2 9 8
  • 19. Doubling Array Inserts 2 9 8 11
  • 20. Doubling Array Inserts 2 9 8 11
  • 21. Doubling Array Inserts 2 8 9 11 etc... Similar to log-structured merge trees (LSM), cache- oblivious lookahead array (COLA), ... O(log N) “levels”, each element is rewritten once per level O((log N) / B) IOs
  • 22. Doubling Array Queries
  • 23. Doubling Array Queries • Add an index to each array to do lookups
  • 24. Doubling Array Queries query(k) • Add an index to each array to do lookups • query(k) searches each array independently
  • 25. Doubling Array Queries query(k) • Bloom Filters can help exclude arrays from search • ... but don’t help with range queries
  • 27. Fractional Cascading • Fractional Cascading: Use information from search at level l to help search at level l+1 • From each array, sample every 4th element and put a pointer to it in previous level
  • 28. Fractional Cascading found entry • Fractional Cascading: Use information from search at level l to help search at level l+1 • From each array, sample every 4th element and put a pointer to it in previous level
  • 29. Fractional Cascading found entry ‘forward pointers’ give bounds for search in next array • Fractional Cascading: Use information from search at level l to help search at level l+1 • From each array, sample every 4th element and put a pointer to it in previous level
  • 37. Fractional Cascading • In case you might get unlucky with the sampling...
  • 38. Fractional Cascading • In case you might get unlucky with the sampling... • ... add regular ‘secondary’ pointers to nearest FP above and below
  • 40. Adding versions version 1 k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 if layout is good for v1 ... v1 v2
  • 41. Adding versions version 1 k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6 version 2 if layout is good for v1 ... ... then it’s bad for v2 v1 v2
  • 42. Adding versions version 1 k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6 version 2 if layout is good for v1 ... ... then it’s bad for v2 if you try to keep all versions of a key close... v1 k1 k2 k3 k4 k5 k6 k6 k7 k8 k9 k10 k11 k12 k13 v2
  • 43. Adding versions version 1 k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6 version 2 if layout is good for v1 ... ... then it’s bad for v2 if you try to keep all versions of a key close... k1 k2 k3 k4 k5 k6 k6 k6 k6 k6 ... k7 k8 k9 k10 k11 k12 k13 ... then it’s bad for all versions versions 2, 3, 4, ...
  • 44. Density k0 k1 k2 k3 v0 v4 v0 v5 v4 v5 v1 v1 v2 v2 v3 v3 W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x • Arrays are tagged with a version set W
  • 45. Density k0 k1 k2 k3 live at v1 v0 live(v1) = 4 live(v2) = 4 v4 live at v3 v0 live(v3) = 4 density = 4/8 v5 v4 v5 v1 v1 v2 v2 v3 v3 W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x • f(A,v) = (#elements in A live at version v) / |A| • density(A,W) = min{w in W} f(A,w)
  • 46. Density k0 k1 k2 k3 live at v1 v0 live(v1) = 4 live(v2) = 4 v4 live at v3 v0 live(v3) = 4 density = 4/8 v5 v4 v5 v1 v1 v2 v2 v3 v3 W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x • f(A,v) = (#elements in A live at version v) / |A| • density(A,W) = min{w in W} f(A,w) • We say the array (A,W) is dense if density ≥1/5 • Tradeoff: high density means good range queries, but many duplicates (imagine density 1 and density 1/N)
  • 47. optimal bound of O(log Nv + Z/B). For much smaller range queries, the worst-case performance may be the same as for a point query. We now prove the amortized bound, which Range queries applies to smaller queries. Theorem 2. A range query at version v costs O(log Nv + Z/B) amortized I/Os. (k,*) Proof. We first consider just point queries, and amortize the cost of lookup(k, v) over all keys live at v. Let l(k, v) be • the cost of lookup(k, v), then the amortized cost is given by imagine scanning over each accessible array k l(k, v)/Nv . • density => trivially true for large (‘voluminous’) range queries • For anfor point queries: v, Ai ) be the number of I/Os used array Ai , let l(k, in examining elements in Ai for lookup(k,v). The idea is • amortize over all k for a fixed version v • each query examines disjoint regions of the array • density implies total size examined = O(Nv log Nv)
  • 48. Don’t worry, stay dense! • Version sets disjoint at each level -- lookups examine one array/level • merge arrays with intersecting version sets • the result of a merge might not be dense • Answer: density amplification! promote merge density amplification demote ... ... {1,2} {2,3} {1,2,3} {1,3} {1} {4} {4} {4}
  • 49. “density amplification” k0 k1 k2 k3 live(v0) = 2 v0 density = 2/11 v4 live(v0) = 2 k0 k1 k2 k3 live(v5) = 4 split 1 v5 v0 v1 density = 2/4 v4 v2 v5 v3 v1 v2 v3 k0 k1 k2 k3 v0 live(v4) = 2 v0 split 2 v4 live(v1) = 3 live(v2) = 3 v5 live(v3) = 3 v4 v5 v1 v1 density = 2/7 split 1 v2 v3 v2 v3 split 2
  • 50. e- If (A, V ) also satisfies (L-live) then every split of it does (since all live elements are included), and likewise for (L- “density amplification” r- h edge). It follows that version splitting (A , V ) – which m necessarily has no promotable versions – results in a set of arrays all of which satisfy all of the L-* conditions necessary k0 k1 k2 k3 to stay atlive(v0) = 2 level l. v0 s, density = 2/11 v4 live(v0) = 2 he The main result of k3 k0 k1 k2 this process is the following. live(v5) = 4 split 1 v5 al v0 v1 density = 2/4 n v4 ut Lemma 3 (Promotion). T he fraction of lead elements v5 v2 e, over v1 l output arrays after a version split is ≥ 1/39. al v3 v2 v3 Proof. First, we claim that under k0 k1same conditions the k2 k3 st as the version split lemma, if in addition |A| < 2M live(v4) = 2 v0 and n split 2 live(v) >= M/3 for all v, then the number of output strata = 3 v0 v4 live(v1) re live(v2) = 3 is at most 13. Consider the arrays which obey the live(v3) = 3 v5 lead o v4 v5 fraction constraint. Each has sizev1at least M/3, since at v1 ng least one version is split live in it, and least half of the array is= 2/7 1 v2 density d lead, sov2at least M/6 lead keys. The total number of lead v3 v3 re keys in split 2 array A is ≤ 2M , since the array itself is no the ui larger than this; it follows that there can be no more than
  • 51. O n snapshot or clone of version v to new descendant ver- ou tpu t sion v , v is registered for each array A which is currently 3.9 Update bound registered to the parent of v. T his does not require any I / Os. Update T he th rays ca ting. Theorem 1. The stratified doubling array performs up- dates to a leaf version v in a cache-oblivious O (log N v / B ) 3.10 amortized I/Os. For lar Z = Ω Proof. A ssume we have at our disposal a memory buffer proper of size at least B (recall that B is not known to the algo- op tima rithm). T hen each array that is involved in a disk merge queries has size at least B , so a merge of some number of arrays of a poin total size k elements costs O (k / B ) I / Os. In the C O L A [5], applies each element exists in exactly one array and may participate in O (log N ) merges, which immediately gives the desired amortized bound. In the scheme described here, elements The may exist in many arrays, and elements may participate in Z/B) a many merges at the same level (eg when an array at level l is version split and some subarrays remain at level l after Pro the version split). N evertheless, we shall prove the theorem the cos
  • 52. O n snapshot or clone of version v to new descendant ver- ou tpu t sion v , v is registered for each array A which is currently 3.9 Update bound registered to the parent of v. T his does not require any I / Os. Update T he th rays ca ting. Theorem 1. The stratified doubling array performs up- dates to a leaf version v in a cache-oblivious O (log N v / B ) 3.10 amortized I/Os. For lar Z = Ω • Not possible to use basic amortized method (some elements in Proof.arrays; somehave at ourmerged many times) A ssume we elements disposal a memory buffer proper many of size at least B (recall that B is not known to the algo- op tima • rithm). T hen each array of merges/splits to leaddisk merge only queries Idea: charge the cost that is involved in a elements • (k,v) appears as lead in of some array -> always N total leadpoin has size at least B , so a merge exactly 1 number of arrays of a total size k elements costs O (k / B ) I / Os. In the C O L A [5], applies • each element exists in exactly one array andpromotion each lead element receives $c/B on may participate • in O (log N ) merges, which immediately v / B) the desired total charge for version v is O(log N gives amortized bound. In the scheme described here, elements The may exist in many arrays, and elements may participate in Z/B) a many merges at the same level (eg when an array at level l is version split and some subarrays remain at level l after Pro the version split). N evertheless, we shall prove the theorem the cos
  • 53. 9: return [split(r)] O n snapshot or clone of version v to new descendant ver- ou tpu t sion v , v is registered for each array A which is currently Update bound registered to the parent of v. T his does not require any I / Os. there is a version split of (A, V ), say (Ai , Vi ) for i = 1 . . . n, such that each array satisfies ( L-dense) and ( L-size) for level T he th rays ca l, and Updateat most one index i for which lead(Ai ) < 3.9 there is ting. |AiTheorem 1. The stratified doubling array performs up- |/2. dates to a leaf version v in a cache-oblivious O (log N v / B ) 3.10 amortized I/Os. For lar If (A, V ) also satisfies (L-live) then every split of it does Z = Ω • (since all live elements basic amortized method (some elements in Not possible to use are included), and likewise for (L- Proof.arrays; somehave at ourmerged many times) A ssume we elements disposal a memory buffer proper many edge). It follows that version splitting (A , V ) – which of size at least B (recall that B is not known to the algo- op tima • necessarily has no promotable versions – results in a set of only rithm). T hen each array of merges/splits to leaddisk merge Idea: charge the cost that is involved in a elements arrays all of which satisfy all of the L-* conditions necessary queries • (k,v) appears as lead in of some array -> always N total leadpoin has size at least B , so a merge exactly 1 number of arrays of to stay at level l. a total size k elements costs O (k / B ) I / Os. In the C O L A [5], applies • element exists in exactly one array andpromotion each lead element receives $c/B on may eachmain result of this process is the following. participate The • in O (log N ) merges, which immediately v / B) the desired total charge for version v is O(log N gives amortized bound. In the scheme described here, elements The may exist 3 many arrays, andTelements may lead elements in (Promotion). he fraction of participate in Z/B) a Lemma over al merges at the same level (eg when is ≥array at level many l output arrays after a version split an 1/39. l is version split and some subarrays remain at level l after Pro the version split). N evertheless, we shall prove the theorem the cos
  • 55. Insert rate, as a function of dictionary size 1e+06 100000 Inserts per second 10000 1000 100 Stratified B-tree CoW B-tree 1 10 Keys (millions) ~3 OoM
  • 56. Range rate, as a function of dictionary size 1e+09 1e+08 Reads per second 1e+07 1e+06 100000 Stratified B-tree CoW B-tree 10000 1 10 Keys (millions) ~1 OoM
  • 57. bitbucket.org/acunu www.acunu.com/download Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the Apache Software Foundation.

Editor's Notes

  • #2: \n
  • #3: \n
  • #4: \n
  • #5: \n
  • #6: \n
  • #7: \n
  • #8: \n
  • #9: \n
  • #10: \n
  • #11: \n
  • #12: LolCoW. if you want to do fast updates, then CoW technique cannot help -- the cow is built around the assumption that every update can do a lookup, and update reference counts\n
  • #13: \n
  • #14: \n
  • #15: \n
  • #16: \n
  • #17: \n
  • #18: \n
  • #19: \n
  • #20: \n
  • #21: \n
  • #22: \n
  • #23: \n
  • #24: \n
  • #25: \n
  • #26: \n
  • #27: \n
  • #28: \n
  • #29: \n
  • #30: \n
  • #31: \n
  • #32: \n
  • #33: \n
  • #34: \n
  • #35: \n
  • #36: \n
  • #37: \n
  • #38: \n
  • #39: \n
  • #40: \n
  • #41: \n
  • #42: \n
  • #43: \n
  • #44: \n
  • #45: \n
  • #46: \n
  • #47: \n
  • #48: \n
  • #49: \n
  • #50: \n
  • #51: \n
  • #52: \n
  • #53: \n
  • #54: \n
  • #55: \n
  • #56: The crucial notion is density. A versioned array, a version tree and its layout on disk. Versions v1,v2,v3 are tagged, so dark entries are lead entries.\nThe entry (k0,v0,x) is written in v0, so it is not a lead entry, but it is live at v1,v2 and v3. Similarly, (k1, v0, x) is live at v1 and v3 (since it was not overwritten at v1) but not at v2.\nThe live counts are as follows: live(v1) = 4, live(v2) = 4, live(v3) = 4, density = 4/8.\nIn practice, the on-disk layout can be compressed by writing the key once for all the versions, and other well-known techniques.\n
  • #57: The crucial notion is density. A versioned array, a version tree and its layout on disk. Versions v1,v2,v3 are tagged, so dark entries are lead entries.\nThe entry (k0,v0,x) is written in v0, so it is not a lead entry, but it is live at v1,v2 and v3. Similarly, (k1, v0, x) is live at v1 and v3 (since it was not overwritten at v1) but not at v2.\nThe live counts are as follows: live(v1) = 4, live(v2) = 4, live(v3) = 4, density = 4/8.\nIn practice, the on-disk layout can be compressed by writing the key once for all the versions, and other well-known techniques.\n
  • #58: \n
  • #59: \n
  • #60: Example of density amplification. The merged array has density $\\frac{2}{11} &lt; \\frac{1}{5}$, so it is not dense. We find a split into two parts: the first split $(A_{1},\\{v_{0},v_{5}\\})$ has size 4 and density $\\frac{1}{2}$. The second split $(A_{2},\\{v_{4}, v_{1}, v_{2},v_{3}\\})$ has size 7 and density $\\frac{2}{7}$. Both splits have size $&lt;8$ and density $\\ge \\frac{1}{5}$, so they can remain at the current level.\n\nWe start at the root version and greedily search for a version $v$ and some subset of its children whose split arrays can be merged into one dense array at level $l$. More precisely, letting $\\mathcal{U}=\\bigcup_{i} \\mathcal{W&apos;}[v_{i}]$, we search for a subset of $v$&apos;s children $\\{v_{i}\\}$ such that \n$$|\\mathrm{split}(\\mathcal{A&apos;},\\mathcal{U}) | &lt; 2^{l+1}.$$ \n\nIf no such set exists at $v$, we recurse into the child $v_{i}$ maximizing $|\\mathrm{split}(\\mathcal{A&apos;}, \\mathcal{W&apos;}[v_{i}])|$. It is possible to show that this always finds a dense split. Once such a set $\\mathcal{U}$ is identified, the corresponding array is written out, and we recurse on the remainder $\\mathrm{split}(\\mathcal{A&apos;}, \\mathcal{W&apos;} \\setminus \\mathcal{U})$. Figure \\ref{fig:split} gives an example of density amplification.\n\n\n
  • #61: Example of density amplification. The merged array has density $\\frac{2}{11} &lt; \\frac{1}{5}$, so it is not dense. We find a split into two parts: the first split $(A_{1},\\{v_{0},v_{5}\\})$ has size 4 and density $\\frac{1}{2}$. The second split $(A_{2},\\{v_{4}, v_{1}, v_{2},v_{3}\\})$ has size 7 and density $\\frac{2}{7}$. Both splits have size $&lt;8$ and density $\\ge \\frac{1}{5}$, so they can remain at the current level.\n\nWe start at the root version and greedily search for a version $v$ and some subset of its children whose split arrays can be merged into one dense array at level $l$. More precisely, letting $\\mathcal{U}=\\bigcup_{i} \\mathcal{W&apos;}[v_{i}]$, we search for a subset of $v$&apos;s children $\\{v_{i}\\}$ such that \n$$|\\mathrm{split}(\\mathcal{A&apos;},\\mathcal{U}) | &lt; 2^{l+1}.$$ \n\nIf no such set exists at $v$, we recurse into the child $v_{i}$ maximizing $|\\mathrm{split}(\\mathcal{A&apos;}, \\mathcal{W&apos;}[v_{i}])|$. It is possible to show that this always finds a dense split. Once such a set $\\mathcal{U}$ is identified, the corresponding array is written out, and we recurse on the remainder $\\mathrm{split}(\\mathcal{A&apos;}, \\mathcal{W&apos;} \\setminus \\mathcal{U})$. Figure \\ref{fig:split} gives an example of density amplification.\n\n\n
  • #62: \n
  • #63: \n
  • #64: \n
  • #65: \n
  • #66: The plot shows range query performance (elements/s extracted using range queries of size 1000).\nThe CoW B-tree is limited by random IO here ((100/s*32KB)/(200 bytes/key) = 16384 key/s), but the Stratified B-tree is CPU-bound (OCaml is single-threaded).\nPreliminary performance results from a highly-concurrent in-kernel implementation suggest that well over 500k updates/s are possible with 16 cores \n
  • #67: \n