Amazon Redshift

What is Redshift?
“Redshift is a fast, fully managed, petabyte-scale
data warehouse service”
-Amazon
With Redshift Monetate is able to generate all of our
analytics data for a day in ~ 2 hours
A process that consumes billions of rows and yields millions

What isn’t Redshift?
warehouse=# insert into fact_page_view values
warehouse-# ('2014-10-02', 1, '2014-10-02 18:30', 2, 3, 4);
INSERT 0 1
Time: 4600.094 ms
warehouse=# select fact_time from fact_page_view
warehouse-# where fact_date = '2014-10-02';
fact_time
---------------------
2014-10-02 18:30:00
(1 row)
Time: 618.303 ms

Who am I?
Jeff Patti
jeffpatti@gmail.com
Backend Engineer at Monetate
Monetate was in Redshifts Beta in late 2012
and has been actively developing on it since.
We’re hiring - monetate.com/jobs/

Leaving Hive For Redshift
● Unusual failure modes
● Slower and pricier than
Redshift, at least in our
configuration
● Custom query language
○ Didn’t play nicely with
our sql libraries
● Fully Managed
● Performant & Scalable
● Excellent integration with
other AWS offerings
● PostgreSQL interface
○ command line interface
○ libraries for PostgreSQL
work against Redshift

Fully Managed
● Easy to deploy
● Easy to scale out
● Software updates - handled
● Hardware failures - taken care of
● Automatic backups - baked in

Automatic Backups
● Periodically taken as delta from prior backup
● Easy to create new cluster from backup, or
overwrite existing cluster
● Queryable during recovery, after short delay
○ Preferentially recovers needed blocks to perform
commands
● This is how Monetate keeps our
development cluster in sync with production

Maintenance Window
● Required half hour window once a week for
routine maintenance, such as software
updates
● During this time the cluster is unresponsive
● You pick when it happens

Scaling Out
You: Change cluster size through AWS console
AWS:
1. Existing cluster put into read only state
2. New cluster caught up with existing cluster
3. Swapped during maintenance window,
unless specified as immediate
a. Immediate swap causes temporary unavailability
during canonical name record swap ( a few minutes)

Monetate
● Core products are merchandising, web &
email personalization, testing
● A/B & Multivariate testing to determine
impact of experiments
● Involved with >20% of US ecommerce spend
each holiday season for the past 3 years
running

Monetate Data Collection
To compute analytics and reports on our clients
experiments, for that we collect a lot of data
● Billions of page views a week
● Billions of experiment views a week
● Millions of purchases a week
● etc.
This is where Redshift comes in handy

Redshift In Monetate
App
App
App
App
App
Monetate is Multi-region
& Multi-AZ
in AWS
Amazon
S3
Amazon
Redshift
Our
Clients
Data Warehousing Analytics & Reporting

Under The Covers
● Fork of PostgreSQL 8.0.2, get nice things like
○ Common Table Expressions
○ Window Functions
● Column oriented database
● Clusters can have many machines
○ Each machine has many slices
○ Queries run in parallel on all slices
● Concurrent query support & memory limiting

Example Redshift Table
CREATE TABLE fact_url (
fact_date DATE NOT NULL ENCODE lzo,
account_id INT NOT NULL ENCODE lzo,
fact_time TIMESTAMP NOT NULL ENCODE lzo,
mid BIGINT NOT NULL ENCODE lzo,
uri VARCHAR(2048) ENCODE lzo,
referer_uri VARCHAR(2048) ENCODE lzo,
PRIMARY KEY (account_id, fact_time, mid)
)
DISTKEY (mid)
SORTKEY (fact_date, account_id, fact_time, mid);

Per Column Compression
● Used to fit more rows in each 1MB block
● Trade off between CPU and IO
● Allows Redshift to read rows from disk faster
● Has to use more CPU to decompress data
● Our Redshift queries are IO bound
○ We use compression extensively

Constraints
“Uniqueness, primary key, and foreign key
constraints are informational only; they are not
enforced by Amazon Redshift.”
However, “If your application allows invalid
foreign keys or primary keys, some queries
could return incorrect results.” [emphasis added]

Distribution Style
Controls how Redshift distributes rows
● Styles
○ Even - round robin rows (default)
○ Key - data with the same key goes to same slice
■ Based on a single column from the table
○ All - data is copied to all slices
■ Good for small tables

DISTKEY impacts Joins
DS_DIST_NONE
No redistribution is required, because
corresponding slices are collocated on the
compute nodes. You will typically have only one
DS_DIST_NONE step, the join between the fact
table and one dimension table.
DS_DIST_ALL_NONE
No redistribution is required, because the inner
join table used DISTSTYLE ALL. The entire
table is located on every node.
These two are very performant
DS_DIST_INNER
The inner table is redistributed.
DS_BCAST_INNER
A copy of the entire inner table is broadcast to all
the compute nodes.
DS_DIST_ALL_INNER
The entire inner table is redistributed to a single
slice because the outer table uses DISTSTYLE
ALL.
DS_DIST_BOTH
Both tables are redistributed.

Query Plan From Explain
-> XN Hash Join DS_DIST_ALL_NONE (cost=112.50..14142.59 rows=170771 width=84)
Hash Cond: ("outer".venueid = "inner".venueid)
-> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456 width=47)
Hash Cond: ("outer".eventid = "inner".eventid)
-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30)
Merge Cond: ("outer".listid = "inner".listid)
-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14)
-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)

Sort Key
● Data is stored on disk in sorted order
○ After being inserted into an empty table, or vacuumed
● Sort Key impacts vacuum performance
● Columnar data stored in 1MB blocks
○ min/max data stored as metadata
● Metadata used to improve query performance
○ Allows Redshift to skip unnecessary blocks

Sort Key Take 1
SORTKEY (account_id, fact_time, mid)
● As we added new facts, bad things started happening
account 1
time ordered
account 2
time ordered
... account n
time ordered
● Resorting rows for vacuuming had to reorder almost all the rows :(
● This made vacuuming unreasonably slow, affecting how often we could
vacuum and therefore query performance
new facts for all
accounts
account 1
time ordered
account 2
time ordered
... account n
time ordered

Sort Key Take 2
SORTKEY (fact_time, account_id, mid)
● Now our table is like an append only log, but had poor query performance
00:00
account ordered
00:01
account ordered
● For many of our queries, we only look at one account at a time
● Redshift blocks are 1MB each, each spanned many accounts
● When querying a single account, had to read from disk and ignore many
rows from other accounts
... Now
account ordered

Sort Key Take 3
SORTKEY (fact_date, account_id, fact_time, mid)
Jan 1st
account ordered
Jan 2nd
account ordered
● Append only log ✓
○ Cheap vacuuming ✓
... Today
● Single or few accounts per block ✓
account ordered
○ Significantly improved query performance ✓

Redshift ⇔ S3
Redshift & S3 have excellent integration
● Unload from Redshift to S3 via UNLOAD
○ Each slice unloads separately to S3
○ We unload into a CSV format
● Load into Redshift from S3 via COPY
○ Applies all as inserts
○ Primary keys aren’t enforced by Redshift
■ Use staging table to detect duplicate keys

Redshift UNLOAD
unload ('select * from venue order by venueid')
to 's3://mybucket/tickit/venue/reload_'
credentials 'aws_access_key_id=<access-key-id>;
aws_secret_access_key=<secret-access-key>'
manifest
delimiter ',';

Redshift UNLOAD Tip
unload ('select * from venue order by venueid')
● Query in unload is quoted which wreaks havoc with
quotes around dates, fact_time <= '2014-10-02'
● Instead of escaping the quotes around the date times
○ unload ($$ select * from venue order by
venueid $$)

Redshift COPY
copy venue
from 's3://mybucket/tickit/venue/reload_manifest'
credentials 'aws_access_key_id=<access-key-id>;
aws_secret_access_key=<secret-access-key>'
manifest
delimiter ',';

Try it Yourself! For Free!!!
Amazon Redshift documentation is well written
It contains great tutorials with pricing estimates
Amazon offers a 750 hour free trial of redshift
DW2.Large nodes

Amazon Redshift

More Related Content

Similar to Amazon Redshift (20)

Recently uploaded (20)

Amazon Redshift