SlideShare a Scribd company logo
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steffen Grunwald, AWS Solutions Architect
Analytics Web Day, 8. November 2018
Query your data in S3 with
SQL and optimize for cost
and performance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What you will learn from this Session
• Benefits of raw Data in Amazon Simple Storage Service
• Query on S3 with Amazon Athena
• Optimize your Data Structure
• Compression
• Partitioning
• Columnar Formats
• Derive Views from raw Data for frequent Queries
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Example Application Architecture
Amazon Kinesis
Streams
Amazon Kinesis
Analytics
Amazon Kinesis
Streams
AWS
Lambda
Amazon
CloudWatch
Amazon Kinesis
Firehose
Amazon
QuickSight
AWS Glue
Amazon
S3
Amazon
Athena
Instance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of raw Data in
Amazon Simple Storage Service (S3)
• Highly durable and cost-effective object store
• Limitlessly scalable
• Pay for what you use - in GB per month
• Decouple storage from compute
• Widely supported API by many consumers
• Well integrated into other AWS systems
Use S3 as long term storage to answer yet unknown
questions of tomorrow.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ingest Data with Amazon Kinesis Firehose
• Stores stream of records as files in a bucket
• Path: <Optional Prefix> + "YYYY/MM/DD/HH“
(Ingestion Time, UTC)
• Optionally compress (GZIP, ZIP, Snappy)
• Optionally store as columnar format (ORC, Parquet)
• Optionally transform records with AWS Lambda
Amazon Kinesis Firehose Amazon S3 Bucket
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena is an interactive query service that
makes it easy to analyze data directly from Amazon
S3 using Standard SQL
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• No ETL required
• Stream data directly from Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Presto SQL
• ANSI SQL compliant
• Complex joins, nested queries &
window functions
• Complex data types (arrays,
structs, maps)
• Partitioning of data by any key
• date, time, custom keys
• Presto built-in functions
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena Supports Multiple Data Formats
• Text files, e.g., CSV, raw logs
• Apache Web Logs, TSV files
• JSON (simple, nested)
• Compressed files
• Columnar formats such as Parquet & ORC
• AVRO support
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena is Cost Effective
• Pay per query
• $5 per TB scanned from S3
• DDL Queries and failed queries are free
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo: Query files from Amazon Kinesis Firehose
with Amazon Athena and AWS Glue
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Example Data
• NYC Taxi & Limousine Commission rides
• Data is generated by kinesis-taxi-stream-
producer available at [1]:
java -jar kinesis-taxi-stream-producer.jar
-speedup 400 -statisticsFrequency 10000
-stream nyctlc-ingestion –noWatermark
-region eu-central-1 -adaptTime ingestion
• ~2GB/h of raw data, 11 days, 487 GB total
[1] https://github.com/aws-samples/flink-stream-
processing-refarch
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Test Setup: Ingesting Data with different Settings
Amazon
Kinesis
Streams
Amazon S3
Instance
Firehose (gzip)
Firehose (raw)
Firehose (orc)
Firehose (parquet)
(max Amazon Kinesis Firehose
buffering hints: 128MB & 900s)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Photo by Glen Noble on Unsplash
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Photo by Tang Junwen on Unsplash
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Partitions to the Rescue
AWS Glue crawler adds partitions based on file prefixes/ dirs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log
S3 Athena
Data Catalog
Schema
Lookup
Create table partitions
Glue
Crawl Partitions with AWS Glue
Query data
Why? Just schedule the crawler, no need to code!
Deals with schema evolution.
Crawl data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Hive-style File Format in S3
Move/ copy:
YYYY/MM/DD/HH/file
year=YYYY/month=MM/day=DD/hours=HH/file
Make Athena reload partitions by: msck repair table
Why? Format easy to create on write, easy to move.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log
S3 Athena
Data Catalog
Schema
Lookup
Add table partition
Lambda
Creating Partitions with AWS Lambda
Query data
New File
Trigger
Why? Add partitions instantly, just AWS Lambda cost.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Populate Partitions if paths are known
Issue Statements with Amazon Athena:
ALTER TABLE mytable
ADD PARTITION
(year='2015',month='01',day='01')
LOCATION 's3://[...]/2015/01/01/'
Why? Easy for predictable paths. Can be prepopulated.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Columnar Formats
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Last_Name
Label
Le Fleming
Lisciandro
Minghi
Jime
Age
34
25
45
63
22
Gender
Fem
Fem
Fem
Mal
Mal
Flat File Sample Layout
First_Name
Tootsie
Miriam
Blakeley
Ernst
Brew
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Columnar Formats Layout (Parquet & ORC)
Last_Name
Label
Le Fleming
Lisciandro
Minghi
Jime
MIN: Jime
MAX: Minghi
Age
34
25
45
63
22
MIN: 22
MAX: 63
Gender
Fem
Fem
Fem
Mal
Mal
MIN: Fem
MAX: Mal
First_Name
Tootsie
Miriam
Blakeley
Ernst
Brew
MIN: Blakeley
MAX: Tootsie
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefit 1: Predicate Pushdown
SELECT * FROM ... WHERE Age > 30
Last_Name
Label
Le Fleming
Lisciandro
Minghi
Jime
MIN: Jime
MAX: Minghi
Age
34
25
45
63
22
MIN: 22
MAX: 63
Gender
Fem
Fem
Fem
Mal
Mal
MIN: Fem
MAX: Mal
First_Name
Tootsie
Miriam
Blakeley
Ernst
Brew
MIN: Blakeley
MAX: Tootsie
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefit 2: Projection Pushdown/ Column Pruning
SELECT First_Name FROM ... WHERE Age > 30
Last_Name
Label
Le Fleming
Lisciandro
Minghi
Jime
MIN: Jime
MAX: Minghi
Age
34
25
45
63
22
MIN: 22
MAX: 63
Gender
Fem
Fem
Fem
Mal
Mal
MIN: Fem
MAX: Mal
First_Name
Tootsie
Miriam
Blakeley
Ernst
Brew
MIN: Blakeley
MAX: Tootsie
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefit 3: Compression & Encoding
• RLE (& Bit Packing) for numbers
• Dictionary for string repetitions (+RLE)
• Delta encoding for increasing numbers
• Delta Strings (for string with a identical prefix)
• Plain encoding for varied strings
https://github.com/apache/parquet-format/blob/master/Encodings.md
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More on Dictionary Encoding
• Builds list of unique strings, assigns numeric ID to each
• If the dictionary size over 1MB (configurable) or number
of distinct values too high, will fall back to Plain
encoding.
• The data itself is later represented as numbers and is
further encoded using RLE
https://github.com/apache/parquet-format/blob/master/Encodings.md
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo: Parquet/ ORC with Amazon Kinesis
Firehose (new!)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analyzing Parquet File
• parquet-tools
• head – view data in file
• meta – get metadata summary
• dump -d -n – get detailed metadata down to page
level stats included
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Schema Information
Row Count Total Byte Size Size in Bytes Value Count Encoding
Download and build [1].
$ java -jar parquet-tools.jar meta <parquetfile>
[1] https://github.com/apache/parquet-mr/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
parquet-tools dump: Encoding & Statistics
total_amount:
- DOUBLE SNAPPY DO:0 FPO:4155231 SZ:329324/338501/1.03
[more]... ST:[min: -76.8, max: 1121.3, num_nulls: 0]
dropoff_datetime:
- BINARY SNAPPY DO:0 FPO:3315979 SZ:839131/5540639/6.60
[more]... ST:[no stats for this column]
Use (unix epoch) or partition by timestamp for time series
data.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analyzing ORC: orcdumpfile
Spin up a single node/ master EMR Cluster and use the
hive command:
hive --orcfiledump file://<absolutepath>/file.orc
[…]
Column 7: count: 210141 hasNull: false min: -
76.96324157714844 max: 0.0 sum: -
1.5329986951126099E7
Column 8: count: 210141 hasNull: false min:
2018-08-30T00:13:48.573Z max: 2018-08-
30T00:28:49.564Z sum: 5043384
[…]
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log
S3 Athena
Data Catalog
Schema
Lookup
Write table partitions
Glue
ETL with AWS Glue For Frequent Queries
Query data
Read/
Write
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo: ETL with AWS Glue
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Example Zeppelin/ AWS Glue Notebook
https://gist.github.com/steffeng/
5b841a99230ba8377f161f5545
3d49d0
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Photo by Benjamin Davies on Unsplash
I applied these simple
tricks when storing data
for Amazon Athena and
you won‘t believe what
happened next...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Measure. Then optimize.
There‘s no silver bullet.
Photo by Cesar Carlevarino Aragon on Unsplash
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimize for Cost and Performance 1/2
• Use Athena in the region of your buckets.
• Compress your data for less storage & query cost.
• Use LIMIT in queries for faster results.
• Partition your data based on data access patterns.
• Use partitions in your queries.
• Add partitions by crawling or S3 triggers.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimize for Cost and Performance 2/2
• Columnar formats as ORC & parquet reduce scanned
data: faster, less cost
• Pick format depending on data, access patterns, clients
• Inspect/ verify the resulting files
• Create aggregates for frequent queries
• Shorten turnaround times for Glue job development:
• Use a provisioned development endpoint
• Use small subset of your data (think KB!)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The AWS Free Tier allows you to
get hands on experience with AWS
Glue and S3. Try it today!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Questions?
Ad

Recommended

Query your data in S3 with SQL and optimize for cost and performance
Query your data in S3 with SQL and optimize for cost and performance
AWS Germany
 
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Dave Nielsen
 
Introduction to Amazon S3
Introduction to Amazon S3
Ashay Shirwadkar
 
AWS におけるエッジでの機械学習
AWS におけるエッジでの機械学習
Amazon Web Services Japan
 
Introduction to Amazon Athena
Introduction to Amazon Athena
Sungmin Kim
 
Amazon Athena Hands-On Workshop
Amazon Athena Hands-On Workshop
DoiT International
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
Kevin Epstein
 
Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017
David McDaniel
 
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
Amazon Web Services LATAM
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Summits
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
Adir Sharabi
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
AWS Riyadh User Group
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
Julien SIMON
 
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
Amazon Web Services Korea
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
Yaroslav Tkachenko
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Amazon Web Services LATAM
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)
Julien SIMON
 
Building Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWS
AWS Summits
 
What is Amazon Athena
What is Amazon Athena
jeetendra mandal
 
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
Steven Hsieh
 
Amazon Athena (March 2017)
Amazon Athena (March 2017)
Julien SIMON
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
saidbilgen
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
SasikumarPalanivel3
 
Athena & AWS Glue for AWS Data analytics.pptx
Athena & AWS Glue for AWS Data analytics.pptx
krnaween
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon Athena
Adam Book
 
Your First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon Elisha
Helen Rogers
 
Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases
AWS Germany
 
Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
AWS Germany
 
Modern Applications Web Day | Impress Your Friends with Your First Serverless...
Modern Applications Web Day | Impress Your Friends with Your First Serverless...
AWS Germany
 

More Related Content

Similar to Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and Performance (20)

Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
Amazon Web Services LATAM
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Summits
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
Adir Sharabi
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
AWS Riyadh User Group
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
Julien SIMON
 
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
Amazon Web Services Korea
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
Yaroslav Tkachenko
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Amazon Web Services LATAM
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)
Julien SIMON
 
Building Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWS
AWS Summits
 
What is Amazon Athena
What is Amazon Athena
jeetendra mandal
 
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
Steven Hsieh
 
Amazon Athena (March 2017)
Amazon Athena (March 2017)
Julien SIMON
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
saidbilgen
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
SasikumarPalanivel3
 
Athena & AWS Glue for AWS Data analytics.pptx
Athena & AWS Glue for AWS Data analytics.pptx
krnaween
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon Athena
Adam Book
 
Your First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon Elisha
Helen Rogers
 
Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases
AWS Germany
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Summits
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
Adir Sharabi
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
AWS Riyadh User Group
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
Julien SIMON
 
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
Amazon Web Services Korea
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
Yaroslav Tkachenko
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Amazon Web Services LATAM
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)
Julien SIMON
 
Building Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWS
AWS Summits
 
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
Steven Hsieh
 
Amazon Athena (March 2017)
Amazon Athena (March 2017)
Julien SIMON
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
saidbilgen
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
SasikumarPalanivel3
 
Athena & AWS Glue for AWS Data analytics.pptx
Athena & AWS Glue for AWS Data analytics.pptx
krnaween
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon Athena
Adam Book
 
Your First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon Elisha
Helen Rogers
 
Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases
AWS Germany
 

More from AWS Germany (20)

Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
AWS Germany
 
Modern Applications Web Day | Impress Your Friends with Your First Serverless...
Modern Applications Web Day | Impress Your Friends with Your First Serverless...
AWS Germany
 
Modern Applications Web Day | Manage Your Infrastructure and Configuration on...
Modern Applications Web Day | Manage Your Infrastructure and Configuration on...
AWS Germany
 
Modern Applications Web Day | Container Workloads on AWS
Modern Applications Web Day | Container Workloads on AWS
AWS Germany
 
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
AWS Germany
 
Building Smart Home skills for Alexa
Building Smart Home skills for Alexa
AWS Germany
 
Hotel or Taxi? "Sorting hat" for travel expenses with AWS ML infrastructure
Hotel or Taxi? "Sorting hat" for travel expenses with AWS ML infrastructure
AWS Germany
 
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
AWS Germany
 
Log Analytics with AWS
Log Analytics with AWS
AWS Germany
 
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
AWS Germany
 
AWS Programme für Nonprofits
AWS Programme für Nonprofits
AWS Germany
 
Microservices and Data Design
Microservices and Data Design
AWS Germany
 
Serverless vs. Developers – the real crash
Serverless vs. Developers – the real crash
AWS Germany
 
Secret Management with Hashicorp’s Vault
Secret Management with Hashicorp’s Vault
AWS Germany
 
EKS Workshop
EKS Workshop
AWS Germany
 
Scale to Infinity with ECS
Scale to Infinity with ECS
AWS Germany
 
Containers on AWS - State of the Union
Containers on AWS - State of the Union
AWS Germany
 
Deploying and Scaling Your First Cloud Application with Amazon Lightsail
Deploying and Scaling Your First Cloud Application with Amazon Lightsail
AWS Germany
 
Building Personalized Data Products - From Idea to Product
Building Personalized Data Products - From Idea to Product
AWS Germany
 
Introduction to AWS Amplify and the Amplify CLI Toolchain
Introduction to AWS Amplify and the Amplify CLI Toolchain
AWS Germany
 
Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
AWS Germany
 
Modern Applications Web Day | Impress Your Friends with Your First Serverless...
Modern Applications Web Day | Impress Your Friends with Your First Serverless...
AWS Germany
 
Modern Applications Web Day | Manage Your Infrastructure and Configuration on...
Modern Applications Web Day | Manage Your Infrastructure and Configuration on...
AWS Germany
 
Modern Applications Web Day | Container Workloads on AWS
Modern Applications Web Day | Container Workloads on AWS
AWS Germany
 
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
AWS Germany
 
Building Smart Home skills for Alexa
Building Smart Home skills for Alexa
AWS Germany
 
Hotel or Taxi? "Sorting hat" for travel expenses with AWS ML infrastructure
Hotel or Taxi? "Sorting hat" for travel expenses with AWS ML infrastructure
AWS Germany
 
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
AWS Germany
 
Log Analytics with AWS
Log Analytics with AWS
AWS Germany
 
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
AWS Germany
 
AWS Programme für Nonprofits
AWS Programme für Nonprofits
AWS Germany
 
Microservices and Data Design
Microservices and Data Design
AWS Germany
 
Serverless vs. Developers – the real crash
Serverless vs. Developers – the real crash
AWS Germany
 
Secret Management with Hashicorp’s Vault
Secret Management with Hashicorp’s Vault
AWS Germany
 
Scale to Infinity with ECS
Scale to Infinity with ECS
AWS Germany
 
Containers on AWS - State of the Union
Containers on AWS - State of the Union
AWS Germany
 
Deploying and Scaling Your First Cloud Application with Amazon Lightsail
Deploying and Scaling Your First Cloud Application with Amazon Lightsail
AWS Germany
 
Building Personalized Data Products - From Idea to Product
Building Personalized Data Products - From Idea to Product
AWS Germany
 
Introduction to AWS Amplify and the Amplify CLI Toolchain
Introduction to AWS Amplify and the Amplify CLI Toolchain
AWS Germany
 
Ad

Recently uploaded (20)

Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
All Things Open
 
The Growing Value and Application of FME & GenAI
The Growing Value and Application of FME & GenAI
Safe Software
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
All Things Open
 
The Growing Value and Application of FME & GenAI
The Growing Value and Application of FME & GenAI
Safe Software
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Ad

Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and Performance

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steffen Grunwald, AWS Solutions Architect Analytics Web Day, 8. November 2018 Query your data in S3 with SQL and optimize for cost and performance
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What you will learn from this Session • Benefits of raw Data in Amazon Simple Storage Service • Query on S3 with Amazon Athena • Optimize your Data Structure • Compression • Partitioning • Columnar Formats • Derive Views from raw Data for frequent Queries
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Application Architecture Amazon Kinesis Streams Amazon Kinesis Analytics Amazon Kinesis Streams AWS Lambda Amazon CloudWatch Amazon Kinesis Firehose Amazon QuickSight AWS Glue Amazon S3 Amazon Athena Instance
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of raw Data in Amazon Simple Storage Service (S3) • Highly durable and cost-effective object store • Limitlessly scalable • Pay for what you use - in GB per month • Decouple storage from compute • Widely supported API by many consumers • Well integrated into other AWS systems Use S3 as long term storage to answer yet unknown questions of tomorrow.
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ingest Data with Amazon Kinesis Firehose • Stores stream of records as files in a bucket • Path: <Optional Prefix> + "YYYY/MM/DD/HH“ (Ingestion Time, UTC) • Optionally compress (GZIP, ZIP, Snappy) • Optionally store as columnar format (ORC, Parquet) • Optionally transform records with AWS Lambda Amazon Kinesis Firehose Amazon S3 Bucket
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query Data Directly from Amazon S3 • No loading of data • Query data in its raw format • No ETL required • Stream data directly from Amazon S3
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Presto SQL • ANSI SQL compliant • Complex joins, nested queries & window functions • Complex data types (arrays, structs, maps) • Partitioning of data by any key • date, time, custom keys • Presto built-in functions
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena Supports Multiple Data Formats • Text files, e.g., CSV, raw logs • Apache Web Logs, TSV files • JSON (simple, nested) • Compressed files • Columnar formats such as Parquet & ORC • AVRO support
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena is Cost Effective • Pay per query • $5 per TB scanned from S3 • DDL Queries and failed queries are free
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo: Query files from Amazon Kinesis Firehose with Amazon Athena and AWS Glue
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Example Data • NYC Taxi & Limousine Commission rides • Data is generated by kinesis-taxi-stream- producer available at [1]: java -jar kinesis-taxi-stream-producer.jar -speedup 400 -statisticsFrequency 10000 -stream nyctlc-ingestion –noWatermark -region eu-central-1 -adaptTime ingestion • ~2GB/h of raw data, 11 days, 487 GB total [1] https://github.com/aws-samples/flink-stream- processing-refarch
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Test Setup: Ingesting Data with different Settings Amazon Kinesis Streams Amazon S3 Instance Firehose (gzip) Firehose (raw) Firehose (orc) Firehose (parquet) (max Amazon Kinesis Firehose buffering hints: 128MB & 900s)
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Photo by Glen Noble on Unsplash
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Photo by Tang Junwen on Unsplash
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Partitions to the Rescue AWS Glue crawler adds partitions based on file prefixes/ dirs
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log S3 Athena Data Catalog Schema Lookup Create table partitions Glue Crawl Partitions with AWS Glue Query data Why? Just schedule the crawler, no need to code! Deals with schema evolution. Crawl data
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Hive-style File Format in S3 Move/ copy: YYYY/MM/DD/HH/file year=YYYY/month=MM/day=DD/hours=HH/file Make Athena reload partitions by: msck repair table Why? Format easy to create on write, easy to move.
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log S3 Athena Data Catalog Schema Lookup Add table partition Lambda Creating Partitions with AWS Lambda Query data New File Trigger Why? Add partitions instantly, just AWS Lambda cost.
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Populate Partitions if paths are known Issue Statements with Amazon Athena: ALTER TABLE mytable ADD PARTITION (year='2015',month='01',day='01') LOCATION 's3://[...]/2015/01/01/' Why? Easy for predictable paths. Can be prepopulated.
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Columnar Formats
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Last_Name Label Le Fleming Lisciandro Minghi Jime Age 34 25 45 63 22 Gender Fem Fem Fem Mal Mal Flat File Sample Layout First_Name Tootsie Miriam Blakeley Ernst Brew
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Columnar Formats Layout (Parquet & ORC) Last_Name Label Le Fleming Lisciandro Minghi Jime MIN: Jime MAX: Minghi Age 34 25 45 63 22 MIN: 22 MAX: 63 Gender Fem Fem Fem Mal Mal MIN: Fem MAX: Mal First_Name Tootsie Miriam Blakeley Ernst Brew MIN: Blakeley MAX: Tootsie
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefit 1: Predicate Pushdown SELECT * FROM ... WHERE Age > 30 Last_Name Label Le Fleming Lisciandro Minghi Jime MIN: Jime MAX: Minghi Age 34 25 45 63 22 MIN: 22 MAX: 63 Gender Fem Fem Fem Mal Mal MIN: Fem MAX: Mal First_Name Tootsie Miriam Blakeley Ernst Brew MIN: Blakeley MAX: Tootsie
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefit 2: Projection Pushdown/ Column Pruning SELECT First_Name FROM ... WHERE Age > 30 Last_Name Label Le Fleming Lisciandro Minghi Jime MIN: Jime MAX: Minghi Age 34 25 45 63 22 MIN: 22 MAX: 63 Gender Fem Fem Fem Mal Mal MIN: Fem MAX: Mal First_Name Tootsie Miriam Blakeley Ernst Brew MIN: Blakeley MAX: Tootsie
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefit 3: Compression & Encoding • RLE (& Bit Packing) for numbers • Dictionary for string repetitions (+RLE) • Delta encoding for increasing numbers • Delta Strings (for string with a identical prefix) • Plain encoding for varied strings https://github.com/apache/parquet-format/blob/master/Encodings.md
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More on Dictionary Encoding • Builds list of unique strings, assigns numeric ID to each • If the dictionary size over 1MB (configurable) or number of distinct values too high, will fall back to Plain encoding. • The data itself is later represented as numbers and is further encoded using RLE https://github.com/apache/parquet-format/blob/master/Encodings.md
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo: Parquet/ ORC with Amazon Kinesis Firehose (new!)
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analyzing Parquet File • parquet-tools • head – view data in file • meta – get metadata summary • dump -d -n – get detailed metadata down to page level stats included
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Schema Information Row Count Total Byte Size Size in Bytes Value Count Encoding Download and build [1]. $ java -jar parquet-tools.jar meta <parquetfile> [1] https://github.com/apache/parquet-mr/
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. parquet-tools dump: Encoding & Statistics total_amount: - DOUBLE SNAPPY DO:0 FPO:4155231 SZ:329324/338501/1.03 [more]... ST:[min: -76.8, max: 1121.3, num_nulls: 0] dropoff_datetime: - BINARY SNAPPY DO:0 FPO:3315979 SZ:839131/5540639/6.60 [more]... ST:[no stats for this column] Use (unix epoch) or partition by timestamp for time series data.
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analyzing ORC: orcdumpfile Spin up a single node/ master EMR Cluster and use the hive command: hive --orcfiledump file://<absolutepath>/file.orc […] Column 7: count: 210141 hasNull: false min: - 76.96324157714844 max: 0.0 sum: - 1.5329986951126099E7 Column 8: count: 210141 hasNull: false min: 2018-08-30T00:13:48.573Z max: 2018-08- 30T00:28:49.564Z sum: 5043384 […]
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log S3 Athena Data Catalog Schema Lookup Write table partitions Glue ETL with AWS Glue For Frequent Queries Query data Read/ Write
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo: ETL with AWS Glue
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example Zeppelin/ AWS Glue Notebook https://gist.github.com/steffeng/ 5b841a99230ba8377f161f5545 3d49d0
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Photo by Benjamin Davies on Unsplash I applied these simple tricks when storing data for Amazon Athena and you won‘t believe what happened next...
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Measure. Then optimize. There‘s no silver bullet. Photo by Cesar Carlevarino Aragon on Unsplash
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimize for Cost and Performance 1/2 • Use Athena in the region of your buckets. • Compress your data for less storage & query cost. • Use LIMIT in queries for faster results. • Partition your data based on data access patterns. • Use partitions in your queries. • Add partitions by crawling or S3 triggers.
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimize for Cost and Performance 2/2 • Columnar formats as ORC & parquet reduce scanned data: faster, less cost • Pick format depending on data, access patterns, clients • Inspect/ verify the resulting files • Create aggregates for frequent queries • Shorten turnaround times for Glue job development: • Use a provisioned development endpoint • Use small subset of your data (think KB!)
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The AWS Free Tier allows you to get hands on experience with AWS Glue and S3. Try it today!
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Questions?