SlideShare a Scribd company logo
Cost Optimization for Hadoop/Spark
Workloads with Amazon EMR
Presented by:
June 2, 2020
Pritpal Sahota
Technical Account Manager
Provectus
Stepan Pushkarev
Chief Technical Officer
Provectus
Nirav Shah
Senior Solution Architect
Amazon Web Services
Perry Peterson
Business Development Manager
Amazon Web Services
1. Provide significant value on how to optimize the cost by migrating to
Amazon EMR
1. Hadoop-Spark workloads to Amazon EMR migration risk mitigation and
best practices
Webinar Objectives
• Introduction
• Hadoop market and Cost optimizations using Amazon EMR
• Cost related and other challenges of on-prem Hadoop clusters
• Cost optimizations by using Amazon EMR and migration best
practices
• Amazon EMR migration acceleration workshop overview
Agenda
Stepan Pushkarev
Chief Technology
Officer
Provectus
Pritpal Sahota
Technical Account
Executive
Provectus
Presenters
Nirav Shah
Senior Solutions
Architect
Amazon Web Services
Perry Peterson
Business Development
Manager – Analytics
Amazon Web Services
AWS Partner Network (APN) Premier Consulting Partner
AI-first Consultancy & Solutions Provider
Сlients ranging from
fast-growing startups
through large
enterprises
450 employees and
growing
Established in 2010
HQ in Palo Alto
Offices across the US,
Canada, and Europe
Machine Learning
Employ analytical algorithms
to unveil hidden value from
raw data that helps solve
business challenges
DevOps/DevSecOps
Improve development and
delivery pipelines to bring
your product to the market
faster and resiliently
Next Gen Cloud
Modernize your application
and data landscape to allow
for more agility and better
service to your customers
Big Data
Gain data-driven insights
through the holistic data
analysis made available with
a big data platform
AWS Competencies in Machine Learning, Data & Analytics, and DevOps
Core Competencies
Innovative Tech Vendors
Seeking for niche expertise to
differentiate and win the market
Enterprises
Seeking to accelerate innovation,
achieve operational excellence
Clientele
Hadoop Market and Cost Optimization
using Amazon EMR
Rapid growth of cloud adoption in big data space
7.5x faster than on-prem installs as per Forrester Research
Uncertainty with leading Hadoop commercial vendors
Leading commercial Hadoop vendors face uncertainty & headwinds. Customers are
exploring cloud to leverage cost benefits, flexibility, scalability, & performance per price
Large & growing Hadoop market
According to market study report, over the next five years the Hadoop market
will register a 33% annual revenue growth with market size reaching $9.4B by 2024
Availability of Resources
Big data engineers prefer to work on cloud based big data solutions
Hadoop market
Amazon EMR is an enterprise-grade Spark/ Hadoop managed service helping businesses, researchers, data analysts, and developers to process and
analyze vast amounts of data. EMR solves complex technical/business challenges: clickstream and log analysis along with real-time and predictive
analytics. In comparison to on-premises deployments, IDC confirms Amazon EMR provides year 1 savings of 57% and 342% ROI over 5 years.
What is EMR & where is it in the Analytics stack?
EMR powers most cloud Hadoop/Spark projects
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights
reserved.
processes 135B events/day and have cost savings of 60% (~$20M)
decreased costs by $600k in less than 5 months
saves 75% and is 60% more efficient
achieves costs savings of 55% when compared to on-demand
pricing and 40% savings when compared to Reserved Instances
High-impact results with Amazon EMR
near real-time analytics for 140M players
scales 3,000 transient clusters on a daily basis
powers the Predix solution processing 1M data executions/day
computes Zestimates on 100M +homes in hours instead of 1 day
reduced cost of operation and improved Spark performance 3x
High-impact results with Amazon EMR
NinthDecimal is the omnichannel marketing platform
helping Fortune 500 brands identify new prospects and
customers, drive store visits, and increase sales using
AI- and data-driven consumer intelligence.
Ninthdecimal is seeing 3x speedup for Spark workloads
on Amazon EMR and 3-5x of cost reduction. It means
better SLAs for delivering insights to the clients and
improved bottom line of the business.
IMVU is the world’s largest avatar-based social network
serving 6M+ players and 40M+ virtual goods
IMVU has migrated 450+ Spark & Hive jobs and re-
architected monolithic Hadoop environment into
transient Amazon EMR clusters orchestrated with
Airflow pipelines.
By moving to AWS and Amazon EMR saved 30% of
costs and became 80% more efficient in data
engineering and analytics.
57%
reduction in cost of ownership
342%
five-year ROI
8 months
to breakeven
99%
reduction in unplanned downtime
33%
more efficient Big Data teams
46%
more efficient Big Data/Hadoop management staff
Referenced IDC White Paper: "The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR"
IDC study: Hadoop to Amazon EMR migration
Amazon EMR Migration patterns
and Best Practices Overview
Amazon EMR Migration Patterns
On-
Premise
s
Lift & Shift Instance
Right-Sizing
S3 vs.
HDFS
Transient
clusters
● Lift & Shift
a. Low Risk & Lowest migration cost
b. Very high ongoing cost
c. Low business value addition
d. quickest time to market
● Re-Architect - Migrate to Amazon EMR with a new architecture with
complementary services to optimize the cost and to provide additional
functionality, scalability, flexibility etc.
a. Medium risk, Medium Migration cost
b. Medium ongoing cost
c. High business value addition
d. Medium time to market
● Next Gen Architecture - Migrate to Amazon EMR with a completely new
architecture which may include Streaming, Containers with added
functionality, scalability, flexibility etc.
a. High risk, Highest Migration Cost
b. Lowest ongoing cost
c. Highest business value addition
d. Longest time to market
An approach to best practice deployment
Go beyond a lift & shift to optimize for scale and cost.
On-Premises Lift & Shift Instance
Right-Sizing
Amazon
S3 vs.
HDFS
Transient
clusters
Auto-
scaling
Spot
Pricing
Automated
Orchestration
Amazon
EMR
Optimized
True TCO
comparison
Business factors:
Capex->Opex
On-prem license fees
Maintenance Overhead
Uncertainty in Hadoop
Vendors
Lowest pricing comparing to
other Hadoop/Spark premium
vendors
Amazon EMR Value Add:
Decoupled Storage & Compute
Transient clusters
Spot pricing
Autoscaling
Optimised hardware
Amazon S3 lifecycle
Proprietary Spark Amazon
EMR engine
Next Gen Architecture Value Add:
Data Pipelines optimization
Streaming processing
Serverless ETL
Serverless ad-hoc queries
Serverless Data Catalog
Workloads decomposition
(Amazon EMR, Amazon Redshift,
Athena, SageMaker)
10-20% Cost Reduction + 10-40% Reduction + 20-90% Reduction
Overview of Cost Optimization Factors
Migration Risk Mitigation Strategies
On-
Premise
s
Lift & Shift Instance
Right-Sizing
S3 vs.
HDFS
Transient
clusters
Auto-
scaling
Spot
Pricing
Automated
Orchestration
EMR
Optimize
d
True TCO
compariso
n
● Analyze all application and workloads to ascertain
compute, memory, storage, run time of day/week/month
and any other infrastructure needs
● Develop a Business Value and Implementation
Complexity Model for all applications and workloads,
Plot business value vs. complexity Prioritization Matrix
● Organized Mirroring of Data loads on to Amazon EMR
cluster with on-prem Hadoop cluster
● Start moving Workloads on to Amazon EMR in an orderly
fashion.
● Identify excited innovators within each business unit to
promote and spread on-prem to Amazon EMR migration
● Work with experts like Provectus to lead this effort.
Complexity
BusinessValue
A
D
B
C
F
E
G
Initial Workloads to
migrate
1. Build a business case of Amazon EMR Migration including comparative cost
analysis
2. Develop a risk mitigation plan
3. Design Next-Gen Data Platform and its adoption roadmap
4. Hands-on execute migration and re-architecture
How Provectus can help
Cost and other challenges of On-Prem
Hadoop/Spark Environments
Compute and storage growth
Tightly
coupled
● Storage grows along with
compute
● Compute requirements vary
3x
● Data is replicated several times
● Typically only on one data center
Underutilized or scarce resources
40
20
0
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
60
80
100
120
Re-processing
Weekly peaks
Steady state
Contention for the same resources
Compute
bound
Memory
bound
With a monolithic cluster, there may be dependencies of downstream applications that impact
the inability to upgrade versions. By not upgrading, organizations could be limiting innovation.
● Large Scale Transformation: Map/Reduce, Hive, Pig, Spark
● Interactive Queries: Impala, Spark SQL, Presto
● Machine Learning: Spark ML, MxNet, Tensorflow
● Interactive Notebooks: Jupyter, Zeppelin
● NoSQL: HBase
Limited on fast following app versions
Cost Optimization using Amazon EMR
Amazon EMR Benefits
Amazon S3 is your persistent storage - 99.9999999% durability, Low cost and
many varieties, Life cycle policies, Versioning, Distributed by default, and EMRFS
Decouple storage and compute
Turn off the cluster
Auto-scaling | Persistent & transient clusters
Logical separation of jobs/applications
Re-architect Monolithic to Purpose-built
clusters by:
• Creating Transient and/or Persistent clusters
• Separating clusters by Application
• Separating clusters by Application Version
• Isolating Department specific clusters
Design consideration are given to:
• How you submit jobs or build pipelines
• Persisting your data in Amazon S3
• Storing metadata off the cluster
• How long the job runs
• What applications are needed
Purpose-built Clusters
Traditional Monolithic Cluster
Built-in disaster recovery
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Availability Zone
Parallelization on Spot can drastically reduce time-to-insight and cost.
Example 1: Baseline example of using RI
10 node cluster running for 14 hours
Cost = $1.0 * 10 nodes * 14 hours = $140
Example 2: Scale more nodes with Spot
Add 10 more nodes of Spot at 50% discount
20 node cluster running for 7 hours
Cost = $1.0 * 10 nodes * 7 hours = $70
= $0.5 * 10 nodes * 7 hours = $35
Total $105
Auto-scale nodes with Spot instances
● The EMR Runtime for Apache Spark available in Amazon EMR
v5.28 realized Spark improvements of up to 32x against TPC-
DS 3TB dataset in comparison to Amazon EMR v5.16
(reference)
● The Amazon EMR Runtime for Apache Spark maintains API
compatibility with OSS Spark
● More coming every release
Spark performance improvements
Analysts confirm lowest TCO
Feb. 2019, Forrester recognizes
Amazon EMR as the Cloud
Hadoop/Spark (HARK) Leader.
Nov. 2018, IDC report confirms:
“EMR provides 57% reduced costs
vs. on premise resulting in 342%
ROI over 5 years.”
Dec. 2018, Gartner suggests:
“AWS remains the largest
Hadoop provider in terms of
both revenue and user base.”
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and
Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester
Wave™ is a graphical representation of Forrester's call on a market and is
plotted using a detailed spreadsheet with exposed scores, weightings, and
comments. Forrester does not endorse any vendor, product, or service
depicted in the Forrester Wave™. Information is based on best available
resources. Opinions reflect judgment at the time and are subject to change.
Benefits Summary
1. Decoupled compute & storage
2. Built-in disaster recovery
3. Turn off your clusters after use
4. Agility of auto-scaling of the clusters
5. Leverage Spot pricing for unused Amazon EC2 capacity
6. Self-service with AWS Service Catalog
7. Spark performance improvements
8. Fully managed Amazon EMR Notebooks
9. Centralized assets and data pipeline orchestration
10. Lowest TCO in the Industry, analysts confirm
11. Amazon EMR is surrounded by the industry’s broadest
analytics ecosystem
The Next-Gen Ecosystem
that Supports You
Serverless analytics
Amazon S3
Data lake
AWS Glue
(ETL &
Data Catalog)
Athena
QuickSight
Serverless. Zero
infrastructure. Zero
administration
$
Never pay for
idle resources
Availability and
fault tolerance
built in
Automatically
scales resources
with usage
AWS IoT
AI/ML
Devices Web
Sensors
Social
AWS Glue
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python, Scala, and
Spark
Data Catalog
• Glue crawlers automatically discovers data and
stores schema
• Catalog makes data searchable, and available
for ETL and queries
• Computes statistics to make queries efficiently
Serverless ETL & Data Catalogue
ETL
• Generates customizable code for common file
type conversion and partitioning
• Schedules and runs your ETL jobs
• Serverless, flexible, and built on open standards
Amazon Athena
Zero setup cost; just point to
Amazon S3 and start querying
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types, and
complex joins and data
types
Serverless: zero
infrastructure, zero
administration
Integrated with QuickSight
Pay only for queries run;
save 30–90% on per- query
costs through compression
Query Instantly Open EasyPay per query
Serverless Interactive Query engine
• Interactive query service to analyze data in Amazon S3 using standard SQL
• No infrastructure to set up or manage and no data to load
• Ability to run SQL queries on data archived in Amazon S3 Glacier
SQL
90% of your
Hadoop Costs
Hadoop Common Pipeline Pattern 1
90% of your
Hadoop Costs
Hadoop Common Pipeline Pattern 2
2-3x of cost
reduction
From Big Data to Fast Data
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
125 University Avenue
Suite 290, Palo Alto
California, 94301
provectus.com
Questions, details?
We would be happy to answer!

More Related Content

PDF
Module 2 - Datalake
PDF
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
PDF
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
PDF
Builders Day' - Databases on AWS: The Right Tool for The Right Job
PPTX
BigData: AWS RedShift with S3, EC2
PDF
Effective Cost Management for Amazon EMR
PDF
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
PPTX
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Module 2 - Datalake
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Builders Day' - Databases on AWS: The Right Tool for The Right Job
BigData: AWS RedShift with S3, EC2
Effective Cost Management for Amazon EMR
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...

Similar to Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR (10)

PDF
Deep dive session - sap and aws - extend and innovate
PDF
Big dataandhp cforawsbrasilsummit
PDF
Amazon Web Services - The New Normal
PDF
Sap on aws webinar on reducing tco 07092017
PDF
Blogthetech why are companies investing billions in sap implementation
PDF
Inawisdom MLOPS
PPTX
Hybrid Cloud Journey - Maximizing Private and Public Cloud
PPTX
SAP on Azure - Deck
PDF
Value of Data Beyond Analytics by Darin Briskman
PDF
Blending AI in Enterprise Architecture.pdf
Deep dive session - sap and aws - extend and innovate
Big dataandhp cforawsbrasilsummit
Amazon Web Services - The New Normal
Sap on aws webinar on reducing tco 07092017
Blogthetech why are companies investing billions in sap implementation
Inawisdom MLOPS
Hybrid Cloud Journey - Maximizing Private and Public Cloud
SAP on Azure - Deck
Value of Data Beyond Analytics by Darin Briskman
Blending AI in Enterprise Architecture.pdf
Ad

More from Provectus (20)

PPTX
Choosing the right IDP Solution
PPTX
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
PPTX
Choosing the Right Document Processing Solution for Healthcare Organizations
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PPTX
AI Stack on AWS: Amazon SageMaker and Beyond
PPTX
Feature Store as a Data Foundation for Machine Learning
PPTX
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
PPTX
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
PDF
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
PDF
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
PDF
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
PDF
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
PDF
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
PDF
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
PDF
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
PDF
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
PDF
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
PDF
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
PPTX
How to implement authorization in your backend with AWS IAM
PDF
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Choosing the right IDP Solution
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Choosing the Right Document Processing Solution for Healthcare Organizations
MLOps and Data Quality: Deploying Reliable ML Models in Production
AI Stack on AWS: Amazon SageMaker and Beyond
Feature Store as a Data Foundation for Machine Learning
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
How to implement authorization in your backend with AWS IAM
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PPTX
Big Data Technologies - Introduction.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Modernizing your data center with Dell and AMD
PPTX
A Presentation on Artificial Intelligence
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
Review of recent advances in non-invasive hemoglobin estimation
Building Integrated photovoltaic BIPV_UPV.pdf
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Understanding_Digital_Forensics_Presentation.pptx
Modernizing your data center with Dell and AMD
A Presentation on Artificial Intelligence
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding

Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR

  • 1. Cost Optimization for Hadoop/Spark Workloads with Amazon EMR Presented by: June 2, 2020 Pritpal Sahota Technical Account Manager Provectus Stepan Pushkarev Chief Technical Officer Provectus Nirav Shah Senior Solution Architect Amazon Web Services Perry Peterson Business Development Manager Amazon Web Services
  • 2. 1. Provide significant value on how to optimize the cost by migrating to Amazon EMR 1. Hadoop-Spark workloads to Amazon EMR migration risk mitigation and best practices Webinar Objectives
  • 3. • Introduction • Hadoop market and Cost optimizations using Amazon EMR • Cost related and other challenges of on-prem Hadoop clusters • Cost optimizations by using Amazon EMR and migration best practices • Amazon EMR migration acceleration workshop overview Agenda
  • 4. Stepan Pushkarev Chief Technology Officer Provectus Pritpal Sahota Technical Account Executive Provectus Presenters Nirav Shah Senior Solutions Architect Amazon Web Services Perry Peterson Business Development Manager – Analytics Amazon Web Services
  • 5. AWS Partner Network (APN) Premier Consulting Partner AI-first Consultancy & Solutions Provider Сlients ranging from fast-growing startups through large enterprises 450 employees and growing Established in 2010 HQ in Palo Alto Offices across the US, Canada, and Europe
  • 6. Machine Learning Employ analytical algorithms to unveil hidden value from raw data that helps solve business challenges DevOps/DevSecOps Improve development and delivery pipelines to bring your product to the market faster and resiliently Next Gen Cloud Modernize your application and data landscape to allow for more agility and better service to your customers Big Data Gain data-driven insights through the holistic data analysis made available with a big data platform AWS Competencies in Machine Learning, Data & Analytics, and DevOps Core Competencies
  • 7. Innovative Tech Vendors Seeking for niche expertise to differentiate and win the market Enterprises Seeking to accelerate innovation, achieve operational excellence Clientele
  • 8. Hadoop Market and Cost Optimization using Amazon EMR
  • 9. Rapid growth of cloud adoption in big data space 7.5x faster than on-prem installs as per Forrester Research Uncertainty with leading Hadoop commercial vendors Leading commercial Hadoop vendors face uncertainty & headwinds. Customers are exploring cloud to leverage cost benefits, flexibility, scalability, & performance per price Large & growing Hadoop market According to market study report, over the next five years the Hadoop market will register a 33% annual revenue growth with market size reaching $9.4B by 2024 Availability of Resources Big data engineers prefer to work on cloud based big data solutions Hadoop market
  • 10. Amazon EMR is an enterprise-grade Spark/ Hadoop managed service helping businesses, researchers, data analysts, and developers to process and analyze vast amounts of data. EMR solves complex technical/business challenges: clickstream and log analysis along with real-time and predictive analytics. In comparison to on-premises deployments, IDC confirms Amazon EMR provides year 1 savings of 57% and 342% ROI over 5 years. What is EMR & where is it in the Analytics stack?
  • 11. EMR powers most cloud Hadoop/Spark projects © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 12. processes 135B events/day and have cost savings of 60% (~$20M) decreased costs by $600k in less than 5 months saves 75% and is 60% more efficient achieves costs savings of 55% when compared to on-demand pricing and 40% savings when compared to Reserved Instances High-impact results with Amazon EMR
  • 13. near real-time analytics for 140M players scales 3,000 transient clusters on a daily basis powers the Predix solution processing 1M data executions/day computes Zestimates on 100M +homes in hours instead of 1 day reduced cost of operation and improved Spark performance 3x High-impact results with Amazon EMR
  • 14. NinthDecimal is the omnichannel marketing platform helping Fortune 500 brands identify new prospects and customers, drive store visits, and increase sales using AI- and data-driven consumer intelligence. Ninthdecimal is seeing 3x speedup for Spark workloads on Amazon EMR and 3-5x of cost reduction. It means better SLAs for delivering insights to the clients and improved bottom line of the business.
  • 15. IMVU is the world’s largest avatar-based social network serving 6M+ players and 40M+ virtual goods IMVU has migrated 450+ Spark & Hive jobs and re- architected monolithic Hadoop environment into transient Amazon EMR clusters orchestrated with Airflow pipelines. By moving to AWS and Amazon EMR saved 30% of costs and became 80% more efficient in data engineering and analytics.
  • 16. 57% reduction in cost of ownership 342% five-year ROI 8 months to breakeven 99% reduction in unplanned downtime 33% more efficient Big Data teams 46% more efficient Big Data/Hadoop management staff Referenced IDC White Paper: "The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR" IDC study: Hadoop to Amazon EMR migration
  • 17. Amazon EMR Migration patterns and Best Practices Overview
  • 18. Amazon EMR Migration Patterns On- Premise s Lift & Shift Instance Right-Sizing S3 vs. HDFS Transient clusters ● Lift & Shift a. Low Risk & Lowest migration cost b. Very high ongoing cost c. Low business value addition d. quickest time to market ● Re-Architect - Migrate to Amazon EMR with a new architecture with complementary services to optimize the cost and to provide additional functionality, scalability, flexibility etc. a. Medium risk, Medium Migration cost b. Medium ongoing cost c. High business value addition d. Medium time to market ● Next Gen Architecture - Migrate to Amazon EMR with a completely new architecture which may include Streaming, Containers with added functionality, scalability, flexibility etc. a. High risk, Highest Migration Cost b. Lowest ongoing cost c. Highest business value addition d. Longest time to market
  • 19. An approach to best practice deployment Go beyond a lift & shift to optimize for scale and cost. On-Premises Lift & Shift Instance Right-Sizing Amazon S3 vs. HDFS Transient clusters Auto- scaling Spot Pricing Automated Orchestration Amazon EMR Optimized True TCO comparison
  • 20. Business factors: Capex->Opex On-prem license fees Maintenance Overhead Uncertainty in Hadoop Vendors Lowest pricing comparing to other Hadoop/Spark premium vendors Amazon EMR Value Add: Decoupled Storage & Compute Transient clusters Spot pricing Autoscaling Optimised hardware Amazon S3 lifecycle Proprietary Spark Amazon EMR engine Next Gen Architecture Value Add: Data Pipelines optimization Streaming processing Serverless ETL Serverless ad-hoc queries Serverless Data Catalog Workloads decomposition (Amazon EMR, Amazon Redshift, Athena, SageMaker) 10-20% Cost Reduction + 10-40% Reduction + 20-90% Reduction Overview of Cost Optimization Factors
  • 21. Migration Risk Mitigation Strategies On- Premise s Lift & Shift Instance Right-Sizing S3 vs. HDFS Transient clusters Auto- scaling Spot Pricing Automated Orchestration EMR Optimize d True TCO compariso n ● Analyze all application and workloads to ascertain compute, memory, storage, run time of day/week/month and any other infrastructure needs ● Develop a Business Value and Implementation Complexity Model for all applications and workloads, Plot business value vs. complexity Prioritization Matrix ● Organized Mirroring of Data loads on to Amazon EMR cluster with on-prem Hadoop cluster ● Start moving Workloads on to Amazon EMR in an orderly fashion. ● Identify excited innovators within each business unit to promote and spread on-prem to Amazon EMR migration ● Work with experts like Provectus to lead this effort. Complexity BusinessValue A D B C F E G Initial Workloads to migrate
  • 22. 1. Build a business case of Amazon EMR Migration including comparative cost analysis 2. Develop a risk mitigation plan 3. Design Next-Gen Data Platform and its adoption roadmap 4. Hands-on execute migration and re-architecture How Provectus can help
  • 23. Cost and other challenges of On-Prem Hadoop/Spark Environments
  • 24. Compute and storage growth Tightly coupled ● Storage grows along with compute ● Compute requirements vary 3x ● Data is replicated several times ● Typically only on one data center
  • 25. Underutilized or scarce resources 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 60 80 100 120 Re-processing Weekly peaks Steady state
  • 26. Contention for the same resources Compute bound Memory bound
  • 27. With a monolithic cluster, there may be dependencies of downstream applications that impact the inability to upgrade versions. By not upgrading, organizations could be limiting innovation. ● Large Scale Transformation: Map/Reduce, Hive, Pig, Spark ● Interactive Queries: Impala, Spark SQL, Presto ● Machine Learning: Spark ML, MxNet, Tensorflow ● Interactive Notebooks: Jupyter, Zeppelin ● NoSQL: HBase Limited on fast following app versions
  • 29. Amazon EMR Benefits Amazon S3 is your persistent storage - 99.9999999% durability, Low cost and many varieties, Life cycle policies, Versioning, Distributed by default, and EMRFS Decouple storage and compute Turn off the cluster Auto-scaling | Persistent & transient clusters
  • 30. Logical separation of jobs/applications Re-architect Monolithic to Purpose-built clusters by: • Creating Transient and/or Persistent clusters • Separating clusters by Application • Separating clusters by Application Version • Isolating Department specific clusters Design consideration are given to: • How you submit jobs or build pipelines • Persisting your data in Amazon S3 • Storing metadata off the cluster • How long the job runs • What applications are needed Purpose-built Clusters Traditional Monolithic Cluster
  • 31. Built-in disaster recovery Cluster 1 Cluster 2 Cluster 3 Cluster 4 Availability Zone
  • 32. Parallelization on Spot can drastically reduce time-to-insight and cost. Example 1: Baseline example of using RI 10 node cluster running for 14 hours Cost = $1.0 * 10 nodes * 14 hours = $140 Example 2: Scale more nodes with Spot Add 10 more nodes of Spot at 50% discount 20 node cluster running for 7 hours Cost = $1.0 * 10 nodes * 7 hours = $70 = $0.5 * 10 nodes * 7 hours = $35 Total $105 Auto-scale nodes with Spot instances
  • 33. ● The EMR Runtime for Apache Spark available in Amazon EMR v5.28 realized Spark improvements of up to 32x against TPC- DS 3TB dataset in comparison to Amazon EMR v5.16 (reference) ● The Amazon EMR Runtime for Apache Spark maintains API compatibility with OSS Spark ● More coming every release Spark performance improvements
  • 34. Analysts confirm lowest TCO Feb. 2019, Forrester recognizes Amazon EMR as the Cloud Hadoop/Spark (HARK) Leader. Nov. 2018, IDC report confirms: “EMR provides 57% reduced costs vs. on premise resulting in 342% ROI over 5 years.” Dec. 2018, Gartner suggests: “AWS remains the largest Hadoop provider in terms of both revenue and user base.” The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave™. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
  • 35. Benefits Summary 1. Decoupled compute & storage 2. Built-in disaster recovery 3. Turn off your clusters after use 4. Agility of auto-scaling of the clusters 5. Leverage Spot pricing for unused Amazon EC2 capacity 6. Self-service with AWS Service Catalog 7. Spark performance improvements 8. Fully managed Amazon EMR Notebooks 9. Centralized assets and data pipeline orchestration 10. Lowest TCO in the Industry, analysts confirm 11. Amazon EMR is surrounded by the industry’s broadest analytics ecosystem
  • 37. Serverless analytics Amazon S3 Data lake AWS Glue (ETL & Data Catalog) Athena QuickSight Serverless. Zero infrastructure. Zero administration $ Never pay for idle resources Availability and fault tolerance built in Automatically scales resources with usage AWS IoT AI/ML Devices Web Sensors Social
  • 38. AWS Glue Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python, Scala, and Spark Data Catalog • Glue crawlers automatically discovers data and stores schema • Catalog makes data searchable, and available for ETL and queries • Computes statistics to make queries efficiently Serverless ETL & Data Catalogue ETL • Generates customizable code for common file type conversion and partitioning • Schedules and runs your ETL jobs • Serverless, flexible, and built on open standards
  • 39. Amazon Athena Zero setup cost; just point to Amazon S3 and start querying ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Serverless: zero infrastructure, zero administration Integrated with QuickSight Pay only for queries run; save 30–90% on per- query costs through compression Query Instantly Open EasyPay per query Serverless Interactive Query engine • Interactive query service to analyze data in Amazon S3 using standard SQL • No infrastructure to set up or manage and no data to load • Ability to run SQL queries on data archived in Amazon S3 Glacier SQL
  • 40. 90% of your Hadoop Costs Hadoop Common Pipeline Pattern 1
  • 41. 90% of your Hadoop Costs Hadoop Common Pipeline Pattern 2
  • 42. 2-3x of cost reduction From Big Data to Fast Data
  • 44. 125 University Avenue Suite 290, Palo Alto California, 94301 provectus.com Questions, details? We would be happy to answer!