SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SSID: Guest
Password: Cube@11999
Building Data Lake on AWS
Adir Sharabi
Solutions Architect, Amazon Web Services
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Floor28 Agenda
GameDay
24 Oct
Enterprise IT Day
23 Oct
Builders Day
AppSync, Alexa & IoT
22 Oct
Big Data Day
14 Oct
ML & DL Day
15 Oct
DevOps Day
16 Oct
DevOps Day
17 Oct
Technical Sessions
Serverless Data Workshop
Big Data UG Meetup
Technical Sessions
SageMaker Workshop
ML&DL Meetup
Technical Sessions
K8s Workshop
DevOps Meetup
Technical Sessions
Spot Workshop
Databases Day
18 Oct
Technical Sessions
Serverless Workshop
Virtual assistants UG Meetup
Technical Sessions
PyTorch Meetup
Technical Sessions
CDK Workshop
AWS IL UG Meetup
Builders Day
Serverless backend
21 Oct
Technical Sessions
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data Day Agenda
# Time Title Speaker
1 9:30 - 10:15 Building Data Lake on AWS Adir Sharabi
2 10:30 - 11:15
Store once, query thrice: Introduction to query
engines on AWS
Daniel Haviv
3 11:30 – 12:15
Introduction to Real-Time Streaming Analytics -
Amazon Kinesis State Of Union
Roy Ben Alta
4 12:30 - 13:15 From data to insights Orit Alul
5 15:00 – 18:00 Serverless Data Processing Workshop Adir Sharabi
6 18:00 – 20:00 Big Data User Group Meetup
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Documents and files Streams
Multiple sources and formats… and growing everyday
Your Data Sources
Records
Amazon
RDS
Amazon
DynamoDB
AWS IoT
On Premises
databases
Spreadsheets Infrastructure
logs
Clickstream data Mobile app data
Social media data Amazon
Redshift
Device data
Sensor data
ERP WEB
Clickstream
Mobile Apps
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Challenges
Analysts Applications
Data Scientists
Business Users API Access BI Tools
Notebooks
Multiple consumers
and requirements
1990 2000 2010 2020
Generated Data
Available for
Analysis
Data Visibility Multiple Access
Mechanisms
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
Relational data
Schema defined prior to data load
TBs-PBs Scale
Operational reporting and ad hoc
Large initial capex + $10K–$50K/TB/Year
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111001
0101011100101010000101
1111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
Relational and non-relational data
Schema defined during analysis
Scale storage and compute independently
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3
Redshift
EMR
Athena Kinesis
Elasticsearch
Service
Amazon S3 as Data Lakes Storage Layer
Kinesis
Video Streams
AI Services
Many ways to bring all kinds of data
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Integration with Big Data Tools
Run any analytics on the same data without movement
Cost effective - Store data at $0.023 / GB / Month
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Store
Simplified Big Data Pipeline
Amazon S3
Ingest
Process &
Analyze Consume
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lots of ingestion tools
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process &
Analyze Consume
Store
Amazon S3
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time data movement and Data Lakes on AWS
Amazon
Kinesis Data
Firehose
Amazon
Kinesis Data
Streams
Kinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library
Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lots of ingestion tools
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process &
Analyze Consume
Store
Amazon S3
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Variety of data processing tools
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
Amazon Athena – interactive analysis
$ SQL
Query instantly
Zero setup cost;
just point to
Amazon S3 and
start querying
Pay per query
Pay only for queries run;
save 30%–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types, and
complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
Amazon EMR – big data processing
Latest versions
Updated with the latest
open source frameworks
within 30 days of
release
Low cost
Flexible billing with per-
second billing, Amazon
EC2 Spot, Reserved
Instances, and Auto
Scaling to reduce costs
50%-80%
Use Amazon S3
storage
Process data directly in
the Amazon S3 data
lake securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Amazon Redshift – data warehousing
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Inexpensive
As low as $1,000 per
terabyte per year, 1/10 the
cost of traditional data
warehouse solutions; start
at $0.25 per hour
Open file formats Secure
Audit everything;
encrypt data end-to-
end; extensive
certification and
compliance
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Extend the data warehouse to exabytes of data in Amazon S3 data lake
Amazon Redshift Spectrum
Amazon S3
Data Lake
Amazon
Redshift data
Amazon Redshift Spectrum
query engine
Exabyte Redshift SQL queries against Amazon S3
Join data across Redshift and Amazon S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
Pay only for the amount of data scanned
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon
Redshift
JDBC / ODBC
...
1 2 3 4 N
Redshift Spectrum
Scale-out serverless compute
COPY
commands
Hot data
Query directly on data
lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multiple ways to consume the data
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Because data is NEVER perfect
Amazon EMR
Spark and Hive running on EMR
Clean
Transform
Concatenate
Convert to better formats
Schedule transformations
Event-driven transformations
Transformations expressed as code
AWS Glue
Event based Server-less ETL engine
AWS Lambda
Trigger-based Code Execution
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Job Authoring Job Execution
Auto-generates ETL code
Python/Scala and Apache Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ETL when you need it
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Realtime - in-stream processing
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift &
Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Spark
Streaming
& Flink
Amazon
Kinesis
Analytics
In stream process
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Data Catalog
One per account
Allows you to share metadata between Amazon Athena, Amazon Redshift
Spectrum, EMR & JDBC sources
Serverless
We added a few extensions:
§ Search over metadata for data discovery
§ Manage Connections – JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas evolve and other
metadata are updated
Central Metadata
Catalog for the data
lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Data Catalog Crawlers
Crawlers automatically build your Data Catalog and keep it in sync
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok
expression
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Catalogs Your Data
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Crawler Databases
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
• MySQL
• MariaDB
• PostgreSQL
• Amazon Redshift
Create additional custom
classifiers
Amazon
DynamoDB
NoSQL Connection
• Aurora
• Oracle
Built-in classifiers
• Avro
• Parquet
• ORC
• XML
• JSON & JSONPaths
• AWS CloudTrail
• BSON
• Logs
• (Apache (Grok), Linux(Grok), MS(Grok),
Ruby, Redis, and many others)
• Delimited
• (comma, pipe, tab, semicolon)
• < ALWAYS GROWING…>
What can crawlers discover?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Call the AWS Glue CreateTable API
Create table manually DDL statement (in Amazon Athena or Amazon EMR)
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
Other ways of populating the catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Write once, catalog once, read multiple, ETL Anywhere
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift &
Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Data Catalog
Store
Amazon S3
Process & Analyze
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Spark
Streaming
& Flink
Amazon
Kinesis
Analytics
In stream process
BI Tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Data lakes and data warehouses complement each other
• Loose Coupling, but highly performant
• Storage, analytics, metadata management, etc..
• Choosing the best tool for the job
• Future-proof your analytics
• Elasticity and multiple clusters for dedicated purposes
• Replace capacity planning with a consumption model
• Don’t forget metadata management
Core Tenets
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank You!
Adir Sharabi
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SSID: Guest
Password: Cube@11999
GAME DAY
PUT YOUR SKILLS TO THE TEST
OCT 24
Register now: bit.ly/Floor28GameDay

More Related Content

PDF
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
PDF
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
PPTX
Construindo data lakes e analytics com AWS
PDF
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
PDF
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
PDF
Builders' Day - Building Data Lakes for Analytics On AWS LC
PDF
Immersion Day - Como simplificar o acesso ao seu ambiente analítico
PPTX
AWS Lake Formation Deep Dive
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Construindo data lakes e analytics com AWS
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Builders' Day - Building Data Lakes for Analytics On AWS LC
Immersion Day - Como simplificar o acesso ao seu ambiente analítico
AWS Lake Formation Deep Dive

Similar to AWS Floor 28 - Building Data lake on AWS (20)

PDF
Value of Data Beyond Analytics by Darin Briskman
PDF
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
Serverless Big Data Architectures: Serverless Data Analytics
PDF
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
PDF
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
PDF
Module 2 - Datalake
PDF
¿Quién es Amazon Web Services?
PPTX
Make your data fly - Building data platform in AWS
PDF
Get Value From Your Data
PPTX
From raw data to business insights. A modern data lake
PDF
Your First Data Lake on AWS_Simon Elisha
PPTX
Building Data Lakes & Analytics on AWS
PDF
Modern Data Platforms - Thinking Data Flywheel on the Cloud
PDF
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
PDF
Building a modern data platform on AWS. Utrecht AWS Dev Day
PDF
Aws Data Engineer Course | Aws Data Engineer Training
PDF
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
PPTX
Solving Big Data problems on AWS by Rajnish Malik
PDF
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
Value of Data Beyond Analytics by Darin Briskman
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Big Data, Ingeniería de datos, y Data Lakes en AWS
Serverless Big Data Architectures: Serverless Data Analytics
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
Module 2 - Datalake
¿Quién es Amazon Web Services?
Make your data fly - Building data platform in AWS
Get Value From Your Data
From raw data to business insights. A modern data lake
Your First Data Lake on AWS_Simon Elisha
Building Data Lakes & Analytics on AWS
Modern Data Platforms - Thinking Data Flywheel on the Cloud
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a modern data platform on AWS. Utrecht AWS Dev Day
Aws Data Engineer Course | Aws Data Engineer Training
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
Solving Big Data problems on AWS by Rajnish Malik
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
Ad

Recently uploaded (20)

PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Encapsulation theory and applications.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
A Presentation on Artificial Intelligence
PDF
Machine learning based COVID-19 study performance prediction
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPT
Teaching material agriculture food technology
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
1. Introduction to Computer Programming.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
Univ-Connecticut-ChatGPT-Presentaion.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative study of natural language inference in Swahili using monolingua...
Encapsulation theory and applications.pdf
Empathic Computing: Creating Shared Understanding
A Presentation on Artificial Intelligence
Machine learning based COVID-19 study performance prediction
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Assigned Numbers - 2025 - Bluetooth® Document
Teaching material agriculture food technology
Spectral efficient network and resource selection model in 5G networks
1. Introduction to Computer Programming.pptx
Encapsulation_ Review paper, used for researhc scholars
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Heart disease approach using modified random forest and particle swarm optimi...
Ad

AWS Floor 28 - Building Data lake on AWS

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SSID: Guest Password: Cube@11999 Building Data Lake on AWS Adir Sharabi Solutions Architect, Amazon Web Services
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Floor28 Agenda GameDay 24 Oct Enterprise IT Day 23 Oct Builders Day AppSync, Alexa & IoT 22 Oct Big Data Day 14 Oct ML & DL Day 15 Oct DevOps Day 16 Oct DevOps Day 17 Oct Technical Sessions Serverless Data Workshop Big Data UG Meetup Technical Sessions SageMaker Workshop ML&DL Meetup Technical Sessions K8s Workshop DevOps Meetup Technical Sessions Spot Workshop Databases Day 18 Oct Technical Sessions Serverless Workshop Virtual assistants UG Meetup Technical Sessions PyTorch Meetup Technical Sessions CDK Workshop AWS IL UG Meetup Builders Day Serverless backend 21 Oct Technical Sessions
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data Day Agenda # Time Title Speaker 1 9:30 - 10:15 Building Data Lake on AWS Adir Sharabi 2 10:30 - 11:15 Store once, query thrice: Introduction to query engines on AWS Daniel Haviv 3 11:30 – 12:15 Introduction to Real-Time Streaming Analytics - Amazon Kinesis State Of Union Roy Ben Alta 4 12:30 - 13:15 From data to insights Orit Alul 5 15:00 – 18:00 Serverless Data Processing Workshop Adir Sharabi 6 18:00 – 20:00 Big Data User Group Meetup
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Documents and files Streams Multiple sources and formats… and growing everyday Your Data Sources Records Amazon RDS Amazon DynamoDB AWS IoT On Premises databases Spreadsheets Infrastructure logs Clickstream data Mobile app data Social media data Amazon Redshift Device data Sensor data ERP WEB Clickstream Mobile Apps
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Challenges Analysts Applications Data Scientists Business Users API Access BI Tools Notebooks Multiple consumers and requirements 1990 2000 2010 2020 Generated Data Available for Analysis Data Visibility Multiple Access Mechanisms
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Traditionally, Analytics Used to Look Like This OLTP ERP CRM LOB Data Warehouse Business Intelligence Relational data Schema defined prior to data load TBs-PBs Scale Operational reporting and ad hoc Large initial capex + $10K–$50K/TB/Year
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes Extend the Traditional Approach OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 1001100001001010111001 0101011100101010000101 1111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine Learning DW Queries Big data processing Interactive Real-time Relational and non-relational data Schema defined during analysis Scale storage and compute independently Diverse analytical engines to gain insights Designed for low-cost storage and analytics
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams Amazon S3 Redshift EMR Athena Kinesis Elasticsearch Service Amazon S3 as Data Lakes Storage Layer Kinesis Video Streams AI Services Many ways to bring all kinds of data Unmatched durability and availability at EB scale Best security, compliance, and audit capabilities Integration with Big Data Tools Run any analytics on the same data without movement Cost effective - Store data at $0.023 / GB / Month
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Store Simplified Big Data Pipeline Amazon S3 Ingest Process & Analyze Consume Data sources Transactions Web logs / cookies ERP Connected devices
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lots of ingestion tools IngestData sources Transactions Web logs / cookies ERP Connected devices Process & Analyze Consume Store Amazon S3 Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time data movement and Data Lakes on AWS Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Kinesis Agent Apache Kafka AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library Amazon S3
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lots of ingestion tools IngestData sources Transactions Web logs / cookies ERP Connected devices Process & Analyze Consume Store Amazon S3 Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Variety of data processing tools IngestData sources Transactions Web logs / cookies ERP Connected devices Consume Amazon Athena Amazon EMR Amazon Redshift & Spectrum Amazon Elasticsearch Amazon AI/ML/DL Services Store Amazon S3 Process & Analyze Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) Amazon Athena – interactive analysis $ SQL Query instantly Zero setup cost; just point to Amazon S3 and start querying Pay per query Pay only for queries run; save 30%–90% on per- query costs through compression Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with Amazon QuickSight
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analytics and ML at scale 19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security Amazon EMR – big data processing Latest versions Updated with the latest open source frameworks within 30 days of release Low cost Flexible billing with per- second billing, Amazon EC2 Spot, Reserved Instances, and Auto Scaling to reduce costs 50%-80% Use Amazon S3 storage Process data directly in the Amazon S3 data lake securely with high performance using the EMRFS connector Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Data Lake 100110000100101011100 1010101110010101000 00111100101100101 010001100001 $
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost Massively parallel, scale from gigabytes to petabytes Amazon Redshift – data warehousing Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance Inexpensive As low as $1,000 per terabyte per year, 1/10 the cost of traditional data warehouse solutions; start at $0.25 per hour Open file formats Secure Audit everything; encrypt data end-to- end; extensive certification and compliance Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3 $
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Extend the data warehouse to exabytes of data in Amazon S3 data lake Amazon Redshift Spectrum Amazon S3 Data Lake Amazon Redshift data Amazon Redshift Spectrum query engine Exabyte Redshift SQL queries against Amazon S3 Join data across Redshift and Amazon S3 Scale compute and storage separately Stable query performance and unlimited concurrency CSV, ORC, Grok, Avro, & Parquet data formats Pay only for the amount of data scanned
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift JDBC / ODBC ... 1 2 3 4 N Redshift Spectrum Scale-out serverless compute COPY commands Hot data Query directly on data lake
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Multiple ways to consume the data IngestData sources Transactions Web logs / cookies ERP Connected devices Consume Amazon Athena Amazon EMR Amazon Redshift & Spectrum Amazon Elasticsearch Amazon AI/ML/DL Services Store Amazon S3 Process & Analyze Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service Amazon QuickSight Jupyter, Zeppelin, HUE Amazon API Gateway BI Tools
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Because data is NEVER perfect Amazon EMR Spark and Hive running on EMR Clean Transform Concatenate Convert to better formats Schedule transformations Event-driven transformations Transformations expressed as code AWS Glue Event based Server-less ETL engine AWS Lambda Trigger-based Code Execution
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Job Authoring Job Execution Auto-generates ETL code Python/Scala and Apache Spark Edit, debug, and share Develop Serverless execution Flexible scheduling Monitoring and alerting Deploy Data Catalog
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ETL when you need it IngestData sources Transactions Web logs / cookies ERP Connected devices Consume Amazon Athena Amazon EMR Amazon Redshift & Spectrum Amazon Elasticsearch Amazon AI/ML/DL Services Store Amazon S3 Process & Analyze Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service Amazon QuickSight Jupyter, Zeppelin, HUE Amazon API Gateway BI Tools
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Realtime - in-stream processing IngestData sources Transactions Web logs / cookies ERP Connected devices Consume Amazon Athena Amazon EMR Amazon Redshift & Spectrum Amazon Elasticsearch Amazon AI/ML/DL Services Store Amazon S3 Process & Analyze Spark Streaming & Flink Amazon Kinesis Analytics In stream process Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service Amazon QuickSight Jupyter, Zeppelin, HUE Amazon API Gateway BI Tools
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Data Catalog One per account Allows you to share metadata between Amazon Athena, Amazon Redshift Spectrum, EMR & JDBC sources Serverless We added a few extensions: § Search over metadata for data discovery § Manage Connections – JDBC URLs, credentials § Classification for identifying and parsing files § Versioning of table metadata as schemas evolve and other metadata are updated Central Metadata Catalog for the data lake
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Data Catalog Crawlers Crawlers automatically build your Data Catalog and keep it in sync Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs Catalogs Your Data
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Crawler Databases Amazon Redshift Amazon S3 JDBC Connection Object Connection • MySQL • MariaDB • PostgreSQL • Amazon Redshift Create additional custom classifiers Amazon DynamoDB NoSQL Connection • Aurora • Oracle Built-in classifiers • Avro • Parquet • ORC • XML • JSON & JSONPaths • AWS CloudTrail • BSON • Logs • (Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis, and many others) • Delimited • (comma, pipe, tab, semicolon) • < ALWAYS GROWING…> What can crawlers discover?
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Call the AWS Glue CreateTable API Create table manually DDL statement (in Amazon Athena or Amazon EMR) Apache Hive Metastore AWS GLUE ETL AWS GLUE DATA CATALOG Import from Apache Hive Metastore Other ways of populating the catalog
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Write once, catalog once, read multiple, ETL Anywhere IngestData sources Transactions Web logs / cookies ERP Connected devices Consume Amazon Athena Amazon EMR Amazon Redshift & Spectrum Amazon Elasticsearch Amazon AI/ML/DL Services Data Catalog Store Amazon S3 Process & Analyze Amazon QuickSight Jupyter, Zeppelin, HUE Amazon API Gateway Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service Spark Streaming & Flink Amazon Kinesis Analytics In stream process BI Tools
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Data lakes and data warehouses complement each other • Loose Coupling, but highly performant • Storage, analytics, metadata management, etc.. • Choosing the best tool for the job • Future-proof your analytics • Elasticity and multiple clusters for dedicated purposes • Replace capacity planning with a consumption model • Don’t forget metadata management Core Tenets
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank You! Adir Sharabi
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SSID: Guest Password: Cube@11999 GAME DAY PUT YOUR SKILLS TO THE TEST OCT 24 Register now: bit.ly/Floor28GameDay