Cassandra and materialized views

1
Cassandra 2.2 and 3.0
new features
DuyHai DOAN
Apache Cassandra Technical Evangelist
#VoxxedBerlin @doanduyhai

Datastax
2
•  Founded in April 2010
•  We contribute a lot to Apache Cassandra™
•  400+ customers (25 of the Fortune 100), 450+ employees
•  Headquarter in San Francisco Bay area
•  EU headquarter in London, ofﬁces in France and Germany
•  Datastax Enterprise = OSS Cassandra + extra features

Materialized Views (MV)
•  Why ?
•  Detailed Impl
•  Gotchas

Why Materialized Views ?
•  Relieve the pain of manual denormalization
CREATE TABLE user(
id int PRIMARY KEY,
country text,
…
);
CREATE TABLE user_by_country(
country text,
id int,
…,
PRIMARY KEY(country, id)
);
4

CREATE TABLE user_by_country (
country text,
id int,
firstname text,
lastname text,
PRIMARY KEY(country, id));
Materialzed View In Action
CREATE MATERIALIZED VIEW user_by_country
AS SELECT country, id, firstname, lastname
FROM user
WHERE country IS NOT NULL AND id IS NOT NULL
PRIMARY KEY(country, id)
5

Materialzed View Syntax
CREATE MATERIALIZED VIEW [IF NOT EXISTS]
keyspace_name.view_name
AS SELECT column1, column2, ...
FROM keyspace_name.table_name
WHERE column1 IS NOT NULL AND column2 IS NOT NULL ...
PRIMARY KEY(column1, column2, ...)
Must select all primary key columns of base table
•  IS NOT NULL condition for now
•  more complex conditions in future
•  at least all primary key columns of base table
(ordering can be different)
•  maximum 1 column NOT pk from base table
6

Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user
SET country=‘FR’
WHERE id=1
①
•  send mutation to all replicas
•  waiting for ack(s) with CL
8

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
②
Acquire local lock on
base table partition
9

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
③
Local read to fetch current values
SELECT * FROM user WHERE id=1
10

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
④
Create BatchLog with
•  DELETE FROM user_by_country
WHERE country = ‘old_value’
•  INSERT INTO
user_by_country(country, id, …)
VALUES(‘FR’, 1, ...)
11

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑤
Execute async BatchLog
to paired view replica
with CL = ONE
12

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑥
Apply base table updade locally
SET COUNTRY=‘FR’
13

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑦
Release local lock
14

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑧
Return ack to
coordinator
15

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑨
If CL ack(s)
received, ack client
16

MV Failure Cases: concurrent updates
Read base row (country=‘UK’)
•  DELETE FROM mv WHERE
country=‘UK’
•  INSERT INTO mv …(country)
VALUES(‘US’)
•  Send async BatchLog
•  Apply update country=‘US’
1) UPDATE … SET country=‘US’ 2) UPDATE … SET country=‘FR’
country=‘UK’
VALUES(‘FR’)
•  Apply update country=‘FR’
t0
t1
t2
Without local lock
17

country=‘UK’
VALUES(‘US’)
country=‘UK’
VALUES(‘FR’)
t0
t1
t2
Without local lock
18
INSERT INTO mv …(country) VALUES(‘US’)
INSERT INTO mv …(country) VALUES(‘FR’)

country=‘UK’
VALUES(‘US’)
Read base row (country=‘US’)
country=‘US’
VALUES(‘FR’)
With local lock
🔒
🔓 🔒
🔓19

MV Failure Cases: failed updates to MV
C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑤
Execute async BatchLog
to paired view replica
with CL = ONE
✘
MV replica down
20

MV Failure Cases: failed updates to MV
C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
BatchLog
replay
MV replica up
21

Materialized View Performance
•  Write performance
•  local lock
•  local read-before-write for MV à update contention on partition (most of perf hits)
•  local batchlog for MV
•  ☞ you only pay this price once whatever number of MV
•  for each base table update: mv_count x 2 (DELETE + INSERT) extra mutations
22

•  Write performance vs manual denormalization
•  MV better because no client-server network trafﬁc for read-before-write
•  MV better because less network trafﬁc for multiple views (client-side BATCH)
•  Makes developer life easier à priceless
23

•  Read performance vs secondary index
•  MV better because single node read (secondary index can hit many nodes)
•  MV better because single read path (secondary index = read index + read data)
24

Materialized Views Consistency
•  Consistency level
•  CL honoured for base table, ONE for MV + local batchlog
•  Weaker consistency guarantees for MV than for base table.
•  Exemple, write at QUORUM
•  guarantee that QUORUM replicas of base tables have received write
•  guarantee that QUORUM of MV replicas will eventually receive DELETE + INSERT
25

Materialized Views Gotchas
•  Beware of hot spots !!!
•  MV user_by_gender 😱
26

User Deﬁne Functions (UDF)
•  Why ?
•  Detailed Impl
•  UDAs
•  Gotchas

Rationale
•  Push computation server-side
•  save network bandwidth (1000 nodes!)
•  simplify client-side code
•  provide standard & useful function (sum, avg …)
•  accelerate analytics use-case (pre-aggregation for Spark)
29

How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALL ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURN returnType
LANGUAGE language
AS $$
// source code here
$$;
30

CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURN returnType
LANGUAGE language
AS $$
// source code here
$$;
An UDF is keyspace-wide
31

RETURN returnType
LANGUAGE language
AS $$
// source code here
$$;
Param name to refer to in the code
Type = CQL3 type
32

RETURN returnType
LANGUAGE language // j
AS $$
// source code here
$$;
Always called
Null-check mandatory in code
33

RETURN returnType
LANGUAGE language // jav
AS $$
// source code here
$$;
If any input is null, code block is
skipped and return null
34

RETURN returnType
LANGUAGE language
AS $$
// source code here
$$;
CQL types
•  primitives (boolean, int, …)
•  collections (list, set, map)
•  tuples
•  UDT
35

RETURN returnType
LANGUAGE language
AS $$
// source code here
$$; JVM supported languages
•  Java, Scala
•  Javascript (slow)
•  Groovy, Jython, JRuby
•  Clojure ( JSR 223 impl issue)
36

RETURN returnType
LANGUAGE language
AS $$
// source code here
$$;
37

UDA
•  Real use-case for UDF
•  Aggregation server-side à huge network bandwidth saving
•  Provide similar behavior for Group By, Sum, Avg etc …
39

How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Only type, no param name
State type
Initial state type
40

STYPE stateType
INITCOND initCond;
Accumulator function. Signature:
accumulatorFunction(stateType, type1, type2, …)
RETURNS stateType
41

STYPE stateType
INITCOND initCond;
Optional final function. Signature:
finalFunction(stateType)
42

STYPE stateType
INITCOND initCond;
UDA return type ?
If finalFunction
•  return type of finalFunction
Else
•  return stateType
43

Gotchas
C* C*
C*
C*
UDA
①
② & ③
⑤
② & ③
② & ③
45

Gotchas
C* C*
C*
C*
UDA
①
② & ③
⑤
② & ③
② & ③
46
Why do not apply UDF/UDA on replica node ?

Gotchas
C* C*
C*
C*
UDA
①
② & ③
④
•  apply accumulatorFunction
•  apply ﬁnalFunction
⑤
② & ③
② & ③
1.  Because of eventual
consistency
2.  UDF/UDA applied AFTER
last-write-win logic
47

Gotchas
48
•  UDA in Cassandra is not distributed !
•  Execute UDA on a large number of rows (106 for ex.)
•  single fat partition
•  multiple partitions
•  full table scan
•  à Increase client-side timeout
•  default Java driver timeout = 12 secs
•  JAVA-1033 JIRA for per-request timeout setting

Cassandra UDA or Apache Spark ?
49
Consistency
Level
Single/Multiple
Partition(s)
Recommended
Approach
ONE Single partition UDA with token-aware driver because node local
ONE Multiple partitions Apache Spark because distributed reads
> ONE Single partition UDA because data-locality lost with Spark
> ONE Multiple partitions Apache Spark deﬁnitely

Cassandra UDA or Apache Spark ?
50
Consistency
Level
Single/Multiple
Partition(s)
Recommended
Approach
ONE Single partition UDA with token-aware driver because node local
ONE Multiple partitions Apache Spark because distributed reads
> ONE Single partition UDA because data-locality lost with Spark
> ONE Multiple partitions Apache Spark deﬁnitely

New Storage Engine
•  Data structure
•  Disk space usage

Pre 3.0 data structure
Map<byte[ ], SortedMap<byte[ ], Cell>>
53
CREATE TABLE sensor_data(
sensor_id uuid,
date timestamp,
sensor_type text,
sensor_value double,
PRIMARY KEY(sensor_id, date)
);

Pre 3.0 on disk layout
54
RowKey: de305d54-75b4-431b-adb2-eb6b9e546014
=> (column=2015-04-27 10:00:00+0100:, value=, timestamp=1430128800)
=> (column=2015-04-27 10:00:00+0100:sensor_type, value=‘Temperature’, timestamp=1430128800)
=> (column=2015-04-27 10:00:00+0100:sensor_value, value=23.48, timestamp=1430128800)
=> (column=2015-04-27 10:01:00+0100:, value=, timestamp=1430128860)
=> (column=2015-04-27 10:01:00+0100:sensor_type, value=‘Temperature’, timestamp=1430128860)
=> (column=2015-04-27 10:01:00+0100:sensor_value, value=24.08, timestamp=1430128860)
Clustering values are repeated
for each normal column
Full timestamp storage

Cassandra 3.0 data structure
Map<byte[ ], SortedMap<ClusteringColumn, Row>>
55
CREATE TABLE sensor_data(
sensor_id uuid,
date timestamp,
sensor_type text,
sensor_value double,
PRIMARY KEY(sensor_id, date)
);

Cassandra 3.0 on disk layout
56
PartitionKey: de305d54-75b4-431b-adb2-eb6b9e546014
=> clusteringColumn:2015-04-27 10:00:00+0100
=> row_timestamp=1430128800
=> (column_value=‘Temperature’, delta_encoded_timestamp=+0)
=> (column_value=23.48, delta_encoded_timestamp=+0)
=> clusteringColumn:2015-04-27 10:01:00+0100
=> row_timestamp=1430128860
=> (column_value=‘Temperature’, delta_encoded_timestamp=+0)
=> (column_value=24.08, delta_encoded_timestamp=+0)
Delta-encoded timestamp
vs row timestamp

Gains
57
•  No clustering value repetition
•  Column labels are stored only once in meta data
•  Delta encoding of timestamp, 8 bytes saved each time
•  Less disk space used

Benchmarks
58
CREATE TABLE events (
id uuid,
date timeuuid,
prop1 int,
prop2 text,
prop3 float,
PRIMARY KEY(id, date));
106 rows
Small string

Benchmarks
59
CREATE TABLE largetext(
key int,
prop1 int,
prop2 text,
PRIMARY KEY(id));
106 rows
Large string (1000)

Benchmarks
60
CREATE TABLE
largeclustering(
key int,
clust text,
prop1 int,
prop2 set<float>,
PRIMARY KEY(id, clust));
106 rowsMedium string (100)
50 items

Benchmarks
61
CREATE TABLE events (
id uuid,
date timeuuid,
prop1 int,
prop2 text,
prop3 float,
PRIMARY KEY(id, date))
WITH COMPACT STORAGE ;

@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You
63

Cassandra and materialized views

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Cassandra and materialized views (20)

Recently uploaded (20)

Cassandra and materialized views