2011.06.20 stratified-btree

Big problems,
Massive data

Stratiﬁed B-trees

Versioned dictionaries

• put(k,ver,data)
Monday 12:00 v10

• get(k_start,k_end,ver)

• clone(v): create a child of v
Monday 16:00 v11

that inherits the latest version of
its keys
Now v12

Versioned dictionaries

• put(k,ver,data)
Monday 12:00 v10

• get(k_start,k_end,ver)

• clone(v): create a child of v
Monday 16:00 v11

that inherits the latest version of
its keys
Now v12

This talk: a versioned dictionary with fast updates,
and optimal space/query/update tradeoffs

Why?
• Powerful: cloning, time-travel,
cache and space-efﬁciency, ...
Monday 12:00 v10

• Give developers a recent
branch of live dataset
Monday 16:00 v11
• Expose different views of same
base dataset

Now v12 v13
Run analytics/tests/etc on
this clone, without
performance impact.

State of the art: copy-on-write
Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to
the B-tree

the B-tree
Problems:
• Space blowup: Each update may
rewrite an entire path
• Slow updates: as above
• Needs random IO to scale
• Concurrency is tricky

the B-tree
Problems:
• Space blowup: Each update may
rewrite an entire path
• Slow updates: as above
• Needs random IO to scale
• Concurrency is tricky
A log ﬁle system makes updates sequential, but relies on
garbage collection (achilles heel!)

~ log (2^30)/log 10000
= 3 IOs/update

CoW B-tree
[ZFS,WAFL,Btrfs,..]
O(logB Nv)
Update
random IOs

Range query
O(Z/B) random
(size Z)

Space O(N B logB Nv)

Nv = #keys live (accessible) at version v
B = “block size”, say 1MB at 100 bytes/entry = 10000 entries
complication: B is asymmetric for ﬂash..

important for ﬂash
~ log (2^30)/log 10000 ~ log (2^30)/10000
= 3 IOs/update = 0.003 IOs/update

CoW B-tree
This talk
[ZFS,WAFL,Btrfs,..]
O(logB Nv) O((log Nv) / B)
Update
random IOs cache-oblivious IOs

Range query
O(Z/B) random O(Z/B) sequential
(size Z)

Space O(N B logB Nv) O(N)

Nv = #keys live (accessible) at version v
B = “block size”, say 1MB at 100 bytes/entry = 10000 entries
complication: B is asymmetric for ﬂash..

Unversioned Case
[Doubling Array]

Doubling Array
Inserts

Buffer arrays in memory
until we have > B of them

Doubling Array
Inserts

2


Doubling Array
Inserts

2

9


Doubling Array
Inserts

2 9


Doubling Array
Inserts

11 2 9

Doubling Array
Inserts

11 2 9

8

Doubling Array
Inserts

2 9

8 11

Doubling Array
Inserts

2 8 9 11

etc...

Similar to log-structured merge trees (LSM), cache-
oblivious lookahead array (COLA), ...
O(log N) “levels”, each element is rewritten once per level
O((log N) / B) IOs

Doubling Array
Queries

• Add an index to each array to do lookups

Doubling Array
Queries

query(k)

• Add an index to each array to do lookups
• query(k) searches each array independently

Doubling Array
Queries

query(k)

• Bloom Filters can help exclude arrays from
search
• ... but don’t help with range queries

Fractional Cascading

• Fractional Cascading:
Use information from search at level l
to help search at level l+1

• From each array, sample every 4th element
and put a pointer to it in previous level

found entry



found entry

‘forward pointers’ give bounds for search in next array




forward pointer
data


search

• In case you might get unlucky with the
sampling...

• In case you might get unlucky with the
sampling...

• ... add regular ‘secondary’ pointers to nearest
FP above and below

Adding versions
version 1

k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13

if layout is good for v1 ...

v1

v2

Adding versions
version 1

k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6

version 2
... then it’s bad for v2

v1

v2

Adding versions
version 1

k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6

version 2

if you try to keep all versions of a key close... v1

k1 k2 k3 k4 k5 k6 k6 k7 k8 k9 k10 k11 k12 k13

v2

Adding versions
version 1

k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6

version 2

if you try to keep all versions of a key close...
k1 k2 k3 k4 k5 k6 k6 k6 k6 k6 ... k7 k8 k9 k10 k11 k12 k13

... then it’s bad for all versions
versions 2, 3, 4, ...

Density
k0 k1 k2 k3

v0

v4 v0

v5
v4 v5
v1 v1

v2
v2 v3
v3

W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x

• Arrays are tagged with a version set W

Density
k0 k1 k2 k3 live at v1
v0
live(v1) = 4
live(v2) = 4 v4
live at v3 v0
live(v3) = 4
density = 4/8 v5
v4 v5
v1 v1

v2
v2 v3
v3


• f(A,v) = (#elements in A live at version v) / |A|

• density(A,W) = min{w in W} f(A,w)

Density
k0 k1 k2 k3 live at v1
v0
live(v1) = 4
live(v2) = 4 v4
live at v3 v0
live(v3) = 4
density = 4/8 v5
v4 v5
v1 v1

v2
v2 v3
v3


• f(A,v) = (#elements in A live at version v) / |A|

• density(A,W) = min{w in W} f(A,w)
• We say the array (A,W) is dense if density ≥1/5

• Tradeoff: high density means good range queries, but many duplicates
(imagine density 1 and density 1/N)

optimal bound of O(log Nv + Z/B). For much smaller range
queries, the worst-case performance may be the same as for
a point query. We now prove the amortized bound, which
Range queries
applies to smaller queries.

Theorem 2. A range query at version v costs O(log Nv +
Z/B) amortized I/Os.
(k,*)
Proof. We ﬁrst consider just point queries, and amortize
the cost of lookup(k, v) over all keys live at v. Let l(k, v) be
•
the cost of lookup(k, v), then the amortized cost is given by
imagine scanning over each accessible array
k l(k, v)/Nv .
• density => trivially true for large (‘voluminous’) range queries
•
For anfor point queries: v, Ai ) be the number of I/Os used
array Ai , let l(k,
in examining elements in Ai for lookup(k,v). The idea is
• amortize over all k for a ﬁxed version v
• each query examines disjoint regions of the array
• density implies total size examined = O(Nv log Nv)

Don’t worry, stay dense!
• Version sets disjoint at each level -- lookups examine one array/level

• merge arrays with intersecting version sets

• the result of a merge might not be dense

• Answer: density ampliﬁcation!

promote merge density ampliﬁcation demote
... ...
{1,2} {2,3}
{1,2,3}

{1,3} {1}

{4}

{4} {4}

“density ampliﬁcation”
k0 k1 k2 k3
live(v0) = 2 v0
density = 2/11
v4 live(v0) = 2
k0 k1 k2 k3 live(v5) = 4
split 1 v5
v0
v1 density = 2/4
v4
v2
v5
v3
v1

v2

v3 k0 k1 k2 k3

v0
live(v4) = 2
v0
split 2 v4 live(v1) = 3
live(v2) = 3
v5
live(v3) = 3
v4 v5 v1
v1 density = 2/7
split 1 v2

v3
v2 v3
split 2

e- If (A, V ) also satisﬁes (L-live) then every split of it does
(since all live elements are included), and likewise for (L-
“density ampliﬁcation”
r-
h edge). It follows that version splitting (A , V ) – which
m necessarily has no promotable versions – results in a set of
arrays all of which satisfy all of the L-* conditions necessary
k0 k1 k2 k3
to stay atlive(v0) = 2
level l.
v0
s, density = 2/11
v4 live(v0) = 2
he The main result of k3
k0 k1 k2 this process is the following. live(v5) = 4
split 1 v5
al v0
v1 density = 2/4
n v4

ut Lemma 3 (Promotion). T he fraction of lead elements
v5
v2

e, over v1 l output arrays after a version split is ≥ 1/39.
al v3

v2

v3
Proof. First, we claim that under k0 k1same conditions
the k2 k3

st as the version split lemma, if in addition |A| < 2M live(v4) = 2
v0 and
n split 2
live(v) >= M/3 for all v, then the number of output strata = 3
v0 v4 live(v1)
re live(v2) = 3
is at most 13. Consider the arrays which obey the live(v3) = 3
v5 lead
o v4 v5
fraction constraint. Each has sizev1at least M/3, since at
v1
ng least one version is split live in it, and least half of the array is= 2/7
1 v2
density

d lead, sov2at least M/6 lead keys. The total number of lead
v3
v3
re keys in split 2 array A is ≤ 2M , since the array itself is no
the
ui larger than this; it follows that there can be no more than

O n snapshot or clone of version v to new descendant ver- ou tpu t
sion v , v is registered for each array A which is currently

3.9
Update bound
registered to the parent of v. T his does not require any I / Os.

Update
T he th
rays ca
ting.
Theorem 1. The stratiﬁed doubling array performs up-
dates to a leaf version v in a cache-oblivious O (log N v / B ) 3.10
amortized I/Os.
For lar
Z = Ω
Proof. A ssume we have at our disposal a memory buﬀer proper
of size at least B (recall that B is not known to the algo- op tima
rithm). T hen each array that is involved in a disk merge queries
has size at least B , so a merge of some number of arrays of a poin
total size k elements costs O (k / B ) I / Os. In the C O L A [5], applies
each element exists in exactly one array and may participate
in O (log N ) merges, which immediately gives the desired
amortized bound. In the scheme described here, elements The
may exist in many arrays, and elements may participate in Z/B) a
many merges at the same level (eg when an array at level
l is version split and some subarrays remain at level l after Pro
the version split). N evertheless, we shall prove the theorem the cos


3.9
Update bound

Update
T he th
rays ca
ting.
Theorem 1. The stratiﬁed doubling array performs up-
amortized I/Os.
For lar
Z = Ω
• Not possible to use basic amortized method (some elements in
Proof.arrays; somehave at ourmerged many times)
A ssume we elements disposal a memory buﬀer proper
many
•
rithm). T hen each array of merges/splits to leaddisk merge only queries
Idea: charge the cost that is involved in a elements
• (k,v) appears as lead in of some array -> always N total leadpoin
has size at least B , so a merge exactly 1 number of arrays of a
•
each element exists in exactly one array andpromotion
each lead element receives $c/B on may participate
•
in O (log N ) merges, which immediately v / B) the desired
total charge for version v is O(log N gives
may exist in many arrays, and elements may participate in Z/B) a
many merges at the same level (eg when an array at level

9: return [split(r)]

Update bound
there is a version split of (A, V ), say (Ai , Vi ) for i = 1 . . . n,
such that each array satisfies ( L-dense) and ( L-size) for level
T he th
rays ca
l, and Updateat most one index i for which lead(Ai ) <
3.9 there is ting.
|AiTheorem 1. The stratified doubling array performs up-
|/2.
amortized I/Os.
For lar
If (A, V ) also satisfies (L-live) then every split of it does Z = Ω
•
(since all live elements basic amortized method (some elements in
Not possible to use are included), and likewise for (L-
Proof.arrays; somehave at ourmerged many times)
A ssume we elements disposal a memory buffer proper
many
edge). It follows that version splitting (A , V ) – which
•
necessarily has no promotable versions – results in a set of only
rithm). T hen each array of merges/splits to leaddisk merge
Idea: charge the cost that is involved in a elements
arrays all of which satisfy all of the L-* conditions necessary
queries
• (k,v) appears as lead in of some array -> always N total leadpoin
has size at least B , so a merge exactly 1 number of arrays of
to stay at level l.
a
• element exists in exactly one array andpromotion
each lead element receives $c/B on may
eachmain result of this process is the following. participate
The
•
in O (log N ) merges, which immediately v / B) the desired
total charge for version v is O(log N gives
may exist 3 many arrays, andTelements may lead elements
in (Promotion). he fraction of participate in Z/B) a
Lemma
over al merges at the same level (eg when is ≥array at level
many l output arrays after a version split an 1/39.

Insert rate, as a function of dictionary size

1e+06

100000
Inserts per second

10000

1000

100 Stratified B-tree
CoW B-tree

1 10
Keys (millions)
~3 OoM

Range rate, as a function of dictionary size
1e+09

1e+08
Reads per second

1e+07

1e+06

100000

Stratified B-tree
CoW B-tree
10000
1 10
Keys (millions)
~1 OoM

bitbucket.org/acunu
www.acunu.com/download

Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and
elephant logos are trademarks of the Apache Software Foundation.

2011.06.20 stratified-btree

More Related Content

What's hot (20)

Similar to 2011.06.20 stratified-btree (20)

More from Acunu (20)

Recently uploaded (20)

2011.06.20 stratified-btree

Editor's Notes