SlideShare a Scribd company logo
Amazon Redshift 
Jeff Patti
What is Redshift? 
“Redshift is a fast, fully managed, petabyte-scale 
data warehouse service” 
-Amazon 
With Redshift Monetate is able to generate all of our 
analytics data for a day in ~ 2 hours 
A process that consumes billions of rows and yields millions
What isn’t Redshift? 
warehouse=# insert into fact_page_view values 
warehouse-# ('2014-10-02', 1, '2014-10-02 18:30', 2, 3, 4); 
INSERT 0 1 
Time: 4600.094 ms 
warehouse=# select fact_time from fact_page_view 
warehouse-# where fact_date = '2014-10-02'; 
fact_time 
--------------------- 
2014-10-02 18:30:00 
(1 row) 
Time: 618.303 ms
Who am I? 
Jeff Patti 
jeffpatti@gmail.com 
Backend Engineer at Monetate 
Monetate was in Redshifts Beta in late 2012 
and has been actively developing on it since. 
We’re hiring - monetate.com/jobs/
Leaving Hive For Redshift 
● Unusual failure modes 
● Slower and pricier than 
Redshift, at least in our 
configuration 
● Custom query language 
○ Didn’t play nicely with 
our sql libraries 
● Fully Managed 
● Performant & Scalable 
● Excellent integration with 
other AWS offerings 
● PostgreSQL interface 
○ command line interface 
○ libraries for PostgreSQL 
work against Redshift
Fully Managed 
● Easy to deploy 
● Easy to scale out 
● Software updates - handled 
● Hardware failures - taken care of 
● Automatic backups - baked in
Amazon Redshift
Amazon Redshift
Amazon Redshift
Amazon Redshift
Amazon Redshift
Automatic Backups 
● Periodically taken as delta from prior backup 
● Easy to create new cluster from backup, or 
overwrite existing cluster 
● Queryable during recovery, after short delay 
○ Preferentially recovers needed blocks to perform 
commands 
● This is how Monetate keeps our 
development cluster in sync with production
Amazon Redshift
Maintenance Window 
● Required half hour window once a week for 
routine maintenance, such as software 
updates 
● During this time the cluster is unresponsive 
● You pick when it happens
Scaling Out 
You: Change cluster size through AWS console 
AWS: 
1. Existing cluster put into read only state 
2. New cluster caught up with existing cluster 
3. Swapped during maintenance window, 
unless specified as immediate 
a. Immediate swap causes temporary unavailability 
during canonical name record swap ( a few minutes)
Monetate 
● Core products are merchandising, web & 
email personalization, testing 
● A/B & Multivariate testing to determine 
impact of experiments 
● Involved with >20% of US ecommerce spend 
each holiday season for the past 3 years 
running
Monetate Data Collection 
To compute analytics and reports on our clients 
experiments, for that we collect a lot of data 
● Billions of page views a week 
● Billions of experiment views a week 
● Millions of purchases a week 
● etc. 
This is where Redshift comes in handy
Redshift In Monetate 
App 
App 
App 
App 
App 
Monetate is Multi-region 
& Multi-AZ 
in AWS 
Amazon 
S3 
Amazon 
Redshift 
Our 
Clients 
Data Warehousing Analytics & Reporting
Under The Covers 
● Fork of PostgreSQL 8.0.2, get nice things like 
○ Common Table Expressions 
○ Window Functions 
● Column oriented database 
● Clusters can have many machines 
○ Each machine has many slices 
○ Queries run in parallel on all slices 
● Concurrent query support & memory limiting
Instance Types
Query Concurrency
Example Redshift Table 
CREATE TABLE fact_url ( 
fact_date DATE NOT NULL ENCODE lzo, 
account_id INT NOT NULL ENCODE lzo, 
fact_time TIMESTAMP NOT NULL ENCODE lzo, 
mid BIGINT NOT NULL ENCODE lzo, 
uri VARCHAR(2048) ENCODE lzo, 
referer_uri VARCHAR(2048) ENCODE lzo, 
PRIMARY KEY (account_id, fact_time, mid) 
) 
DISTKEY (mid) 
SORTKEY (fact_date, account_id, fact_time, mid);
Per Column Compression 
● Used to fit more rows in each 1MB block 
● Trade off between CPU and IO 
● Allows Redshift to read rows from disk faster 
● Has to use more CPU to decompress data 
● Our Redshift queries are IO bound 
○ We use compression extensively
Constraints 
“Uniqueness, primary key, and foreign key 
constraints are informational only; they are not 
enforced by Amazon Redshift.” 
However, “If your application allows invalid 
foreign keys or primary keys, some queries 
could return incorrect results.” [emphasis added]
Distribution Style 
Controls how Redshift distributes rows 
● Styles 
○ Even - round robin rows (default) 
○ Key - data with the same key goes to same slice 
■ Based on a single column from the table 
○ All - data is copied to all slices 
■ Good for small tables
DISTKEY impacts Joins 
DS_DIST_NONE 
No redistribution is required, because 
corresponding slices are collocated on the 
compute nodes. You will typically have only one 
DS_DIST_NONE step, the join between the fact 
table and one dimension table. 
DS_DIST_ALL_NONE 
No redistribution is required, because the inner 
join table used DISTSTYLE ALL. The entire 
table is located on every node. 
These two are very performant 
DS_DIST_INNER 
The inner table is redistributed. 
DS_BCAST_INNER 
A copy of the entire inner table is broadcast to all 
the compute nodes. 
DS_DIST_ALL_INNER 
The entire inner table is redistributed to a single 
slice because the outer table uses DISTSTYLE 
ALL. 
DS_DIST_BOTH 
Both tables are redistributed.
Query Plan From Explain 
-> XN Hash Join DS_DIST_ALL_NONE (cost=112.50..14142.59 rows=170771 width=84) 
Hash Cond: ("outer".venueid = "inner".venueid) 
-> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456 width=47) 
Hash Cond: ("outer".eventid = "inner".eventid) 
-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30) 
Merge Cond: ("outer".listid = "inner".listid) 
-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14) 
-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)
Sort Key 
● Data is stored on disk in sorted order 
○ After being inserted into an empty table, or vacuumed 
● Sort Key impacts vacuum performance 
● Columnar data stored in 1MB blocks 
○ min/max data stored as metadata 
● Metadata used to improve query performance 
○ Allows Redshift to skip unnecessary blocks
Sort Key Take 1 
SORTKEY (account_id, fact_time, mid) 
● As we added new facts, bad things started happening 
account 1 
time ordered 
account 2 
time ordered 
... account n 
time ordered 
● Resorting rows for vacuuming had to reorder almost all the rows :( 
● This made vacuuming unreasonably slow, affecting how often we could 
vacuum and therefore query performance 
new facts for all 
accounts 
account 1 
time ordered 
account 2 
time ordered 
... account n 
time ordered
Sort Key Take 2 
SORTKEY (fact_time, account_id, mid) 
● Now our table is like an append only log, but had poor query performance 
00:00 
account ordered 
00:01 
account ordered 
● For many of our queries, we only look at one account at a time 
● Redshift blocks are 1MB each, each spanned many accounts 
● When querying a single account, had to read from disk and ignore many 
rows from other accounts 
... Now 
account ordered
Sort Key Take 3 
SORTKEY (fact_date, account_id, fact_time, mid) 
Jan 1st 
account ordered 
Jan 2nd 
account ordered 
● Append only log ✓ 
○ Cheap vacuuming ✓ 
... Today 
● Single or few accounts per block ✓ 
account ordered 
○ Significantly improved query performance ✓
Redshift ⇔ S3 
Redshift & S3 have excellent integration 
● Unload from Redshift to S3 via UNLOAD 
○ Each slice unloads separately to S3 
○ We unload into a CSV format 
● Load into Redshift from S3 via COPY 
○ Applies all as inserts 
○ Primary keys aren’t enforced by Redshift 
■ Use staging table to detect duplicate keys
Redshift UNLOAD 
unload ('select * from venue order by venueid') 
to 's3://mybucket/tickit/venue/reload_' 
credentials 'aws_access_key_id=<access-key-id>; 
aws_secret_access_key=<secret-access-key>' 
manifest 
delimiter ',';
Redshift UNLOAD Tip 
unload ('select * from venue order by venueid') 
● Query in unload is quoted which wreaks havoc with 
quotes around dates, fact_time <= '2014-10-02' 
● Instead of escaping the quotes around the date times 
○ unload ($$ select * from venue order by 
venueid $$)
Redshift COPY 
copy venue 
from 's3://mybucket/tickit/venue/reload_manifest' 
credentials 'aws_access_key_id=<access-key-id>; 
aws_secret_access_key=<secret-access-key>' 
manifest 
delimiter ',';
Try it Yourself! For Free!!! 
Amazon Redshift documentation is well written 
It contains great tutorials with pricing estimates 
Amazon offers a 750 hour free trial of redshift 
DW2.Large nodes
Questions?

More Related Content

PPTX
Scalability of Amazon Redshift Data Loading and Query Speed
PPTX
BigData: AWS RedShift with S3, EC2
PPTX
Real Time Big Data Processing on AWS
PPTX
A tour of Amazon Redshift
PDF
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
PDF
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
Scalability of Amazon Redshift Data Loading and Query Speed
BigData: AWS RedShift with S3, EC2
Real Time Big Data Processing on AWS
A tour of Amazon Redshift
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...

Similar to Amazon Redshift (20)

PDF
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
PDF
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
PDF
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
PDF
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
PDF
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
PDF
Argus Production Monitoring at Salesforce
PDF
Argus Production Monitoring at Salesforce
PPTX
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
PDF
Deep Dive into DynamoDB
PPTX
AWS (Amazon Redshift) presentation
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
PDF
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
PPTX
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
PDF
2017 AWS DB Day | Amazon Redshift 소개 및 실습
PDF
Really Big Elephants: PostgreSQL DW
PPTX
Sizing MongoDB Clusters
PPTX
Maryna Popova "Deep dive AWS Redshift"
PDF
Aerospike Hybrid Memory Architecture
PPTX
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
PDF
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Deep Dive into DynamoDB
AWS (Amazon Redshift) presentation
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
2017 AWS DB Day | Amazon Redshift 소개 및 실습
Really Big Elephants: PostgreSQL DW
Sizing MongoDB Clusters
Maryna Popova "Deep dive AWS Redshift"
Aerospike Hybrid Memory Architecture
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Ad

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
1. Introduction to Computer Programming.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Spectroscopy.pptx food analysis technology
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine Learning_overview_presentation.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
1. Introduction to Computer Programming.pptx
A comparative analysis of optical character recognition models for extracting...
NewMind AI Weekly Chronicles - August'25-Week II
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectroscopy.pptx food analysis technology
cloud_computing_Infrastucture_as_cloud_p
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Ad

Amazon Redshift

  • 2. What is Redshift? “Redshift is a fast, fully managed, petabyte-scale data warehouse service” -Amazon With Redshift Monetate is able to generate all of our analytics data for a day in ~ 2 hours A process that consumes billions of rows and yields millions
  • 3. What isn’t Redshift? warehouse=# insert into fact_page_view values warehouse-# ('2014-10-02', 1, '2014-10-02 18:30', 2, 3, 4); INSERT 0 1 Time: 4600.094 ms warehouse=# select fact_time from fact_page_view warehouse-# where fact_date = '2014-10-02'; fact_time --------------------- 2014-10-02 18:30:00 (1 row) Time: 618.303 ms
  • 4. Who am I? Jeff Patti [email protected] Backend Engineer at Monetate Monetate was in Redshifts Beta in late 2012 and has been actively developing on it since. We’re hiring - monetate.com/jobs/
  • 5. Leaving Hive For Redshift ● Unusual failure modes ● Slower and pricier than Redshift, at least in our configuration ● Custom query language ○ Didn’t play nicely with our sql libraries ● Fully Managed ● Performant & Scalable ● Excellent integration with other AWS offerings ● PostgreSQL interface ○ command line interface ○ libraries for PostgreSQL work against Redshift
  • 6. Fully Managed ● Easy to deploy ● Easy to scale out ● Software updates - handled ● Hardware failures - taken care of ● Automatic backups - baked in
  • 12. Automatic Backups ● Periodically taken as delta from prior backup ● Easy to create new cluster from backup, or overwrite existing cluster ● Queryable during recovery, after short delay ○ Preferentially recovers needed blocks to perform commands ● This is how Monetate keeps our development cluster in sync with production
  • 14. Maintenance Window ● Required half hour window once a week for routine maintenance, such as software updates ● During this time the cluster is unresponsive ● You pick when it happens
  • 15. Scaling Out You: Change cluster size through AWS console AWS: 1. Existing cluster put into read only state 2. New cluster caught up with existing cluster 3. Swapped during maintenance window, unless specified as immediate a. Immediate swap causes temporary unavailability during canonical name record swap ( a few minutes)
  • 16. Monetate ● Core products are merchandising, web & email personalization, testing ● A/B & Multivariate testing to determine impact of experiments ● Involved with >20% of US ecommerce spend each holiday season for the past 3 years running
  • 17. Monetate Data Collection To compute analytics and reports on our clients experiments, for that we collect a lot of data ● Billions of page views a week ● Billions of experiment views a week ● Millions of purchases a week ● etc. This is where Redshift comes in handy
  • 18. Redshift In Monetate App App App App App Monetate is Multi-region & Multi-AZ in AWS Amazon S3 Amazon Redshift Our Clients Data Warehousing Analytics & Reporting
  • 19. Under The Covers ● Fork of PostgreSQL 8.0.2, get nice things like ○ Common Table Expressions ○ Window Functions ● Column oriented database ● Clusters can have many machines ○ Each machine has many slices ○ Queries run in parallel on all slices ● Concurrent query support & memory limiting
  • 22. Example Redshift Table CREATE TABLE fact_url ( fact_date DATE NOT NULL ENCODE lzo, account_id INT NOT NULL ENCODE lzo, fact_time TIMESTAMP NOT NULL ENCODE lzo, mid BIGINT NOT NULL ENCODE lzo, uri VARCHAR(2048) ENCODE lzo, referer_uri VARCHAR(2048) ENCODE lzo, PRIMARY KEY (account_id, fact_time, mid) ) DISTKEY (mid) SORTKEY (fact_date, account_id, fact_time, mid);
  • 23. Per Column Compression ● Used to fit more rows in each 1MB block ● Trade off between CPU and IO ● Allows Redshift to read rows from disk faster ● Has to use more CPU to decompress data ● Our Redshift queries are IO bound ○ We use compression extensively
  • 24. Constraints “Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift.” However, “If your application allows invalid foreign keys or primary keys, some queries could return incorrect results.” [emphasis added]
  • 25. Distribution Style Controls how Redshift distributes rows ● Styles ○ Even - round robin rows (default) ○ Key - data with the same key goes to same slice ■ Based on a single column from the table ○ All - data is copied to all slices ■ Good for small tables
  • 26. DISTKEY impacts Joins DS_DIST_NONE No redistribution is required, because corresponding slices are collocated on the compute nodes. You will typically have only one DS_DIST_NONE step, the join between the fact table and one dimension table. DS_DIST_ALL_NONE No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is located on every node. These two are very performant DS_DIST_INNER The inner table is redistributed. DS_BCAST_INNER A copy of the entire inner table is broadcast to all the compute nodes. DS_DIST_ALL_INNER The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL. DS_DIST_BOTH Both tables are redistributed.
  • 27. Query Plan From Explain -> XN Hash Join DS_DIST_ALL_NONE (cost=112.50..14142.59 rows=170771 width=84) Hash Cond: ("outer".venueid = "inner".venueid) -> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456 width=47) Hash Cond: ("outer".eventid = "inner".eventid) -> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30) Merge Cond: ("outer".listid = "inner".listid) -> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14) -> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)
  • 28. Sort Key ● Data is stored on disk in sorted order ○ After being inserted into an empty table, or vacuumed ● Sort Key impacts vacuum performance ● Columnar data stored in 1MB blocks ○ min/max data stored as metadata ● Metadata used to improve query performance ○ Allows Redshift to skip unnecessary blocks
  • 29. Sort Key Take 1 SORTKEY (account_id, fact_time, mid) ● As we added new facts, bad things started happening account 1 time ordered account 2 time ordered ... account n time ordered ● Resorting rows for vacuuming had to reorder almost all the rows :( ● This made vacuuming unreasonably slow, affecting how often we could vacuum and therefore query performance new facts for all accounts account 1 time ordered account 2 time ordered ... account n time ordered
  • 30. Sort Key Take 2 SORTKEY (fact_time, account_id, mid) ● Now our table is like an append only log, but had poor query performance 00:00 account ordered 00:01 account ordered ● For many of our queries, we only look at one account at a time ● Redshift blocks are 1MB each, each spanned many accounts ● When querying a single account, had to read from disk and ignore many rows from other accounts ... Now account ordered
  • 31. Sort Key Take 3 SORTKEY (fact_date, account_id, fact_time, mid) Jan 1st account ordered Jan 2nd account ordered ● Append only log ✓ ○ Cheap vacuuming ✓ ... Today ● Single or few accounts per block ✓ account ordered ○ Significantly improved query performance ✓
  • 32. Redshift ⇔ S3 Redshift & S3 have excellent integration ● Unload from Redshift to S3 via UNLOAD ○ Each slice unloads separately to S3 ○ We unload into a CSV format ● Load into Redshift from S3 via COPY ○ Applies all as inserts ○ Primary keys aren’t enforced by Redshift ■ Use staging table to detect duplicate keys
  • 33. Redshift UNLOAD unload ('select * from venue order by venueid') to 's3://mybucket/tickit/venue/reload_' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>' manifest delimiter ',';
  • 34. Redshift UNLOAD Tip unload ('select * from venue order by venueid') ● Query in unload is quoted which wreaks havoc with quotes around dates, fact_time <= '2014-10-02' ● Instead of escaping the quotes around the date times ○ unload ($$ select * from venue order by venueid $$)
  • 35. Redshift COPY copy venue from 's3://mybucket/tickit/venue/reload_manifest' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>' manifest delimiter ',';
  • 36. Try it Yourself! For Free!!! Amazon Redshift documentation is well written It contains great tutorials with pricing estimates Amazon offers a 750 hour free trial of redshift DW2.Large nodes