Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR

Cost Optimization for Hadoop/Spark
Workloads with Amazon EMR
Presented by:
June 2, 2020
Pritpal Sahota
Technical Account Manager
Provectus
Stepan Pushkarev
Chief Technical Officer
Provectus
Nirav Shah
Senior Solution Architect
Amazon Web Services
Perry Peterson
Business Development Manager
Amazon Web Services

1. Provide significant value on how to optimize the cost by migrating to
Amazon EMR
1. Hadoop-Spark workloads to Amazon EMR migration risk mitigation and
best practices
Webinar Objectives

• Introduction
• Hadoop market and Cost optimizations using Amazon EMR
• Cost related and other challenges of on-prem Hadoop clusters
• Cost optimizations by using Amazon EMR and migration best
practices
• Amazon EMR migration acceleration workshop overview
Agenda

Stepan Pushkarev
Chief Technology
Officer
Provectus
Pritpal Sahota
Technical Account
Executive
Provectus
Presenters
Nirav Shah
Senior Solutions
Architect
Amazon Web Services
Perry Peterson
Business Development
Manager – Analytics
Amazon Web Services

AWS Partner Network (APN) Premier Consulting Partner
AI-first Consultancy & Solutions Provider
Сlients ranging from
fast-growing startups
through large
enterprises
450 employees and
growing
Established in 2010
HQ in Palo Alto
Offices across the US,
Canada, and Europe

Machine Learning
Employ analytical algorithms
to unveil hidden value from
raw data that helps solve
business challenges
DevOps/DevSecOps
Improve development and
delivery pipelines to bring
your product to the market
faster and resiliently
Next Gen Cloud
Modernize your application
and data landscape to allow
for more agility and better
service to your customers
Big Data
Gain data-driven insights
through the holistic data
analysis made available with
a big data platform
AWS Competencies in Machine Learning, Data & Analytics, and DevOps
Core Competencies

Innovative Tech Vendors
Seeking for niche expertise to
differentiate and win the market
Enterprises
Seeking to accelerate innovation,
achieve operational excellence
Clientele

Hadoop Market and Cost Optimization
using Amazon EMR

Rapid growth of cloud adoption in big data space
7.5x faster than on-prem installs as per Forrester Research
Uncertainty with leading Hadoop commercial vendors
Leading commercial Hadoop vendors face uncertainty & headwinds. Customers are
exploring cloud to leverage cost benefits, flexibility, scalability, & performance per price
Large & growing Hadoop market
According to market study report, over the next five years the Hadoop market
will register a 33% annual revenue growth with market size reaching $9.4B by 2024
Availability of Resources
Big data engineers prefer to work on cloud based big data solutions
Hadoop market

Amazon EMR is an enterprise-grade Spark/ Hadoop managed service helping businesses, researchers, data analysts, and developers to process and
analyze vast amounts of data. EMR solves complex technical/business challenges: clickstream and log analysis along with real-time and predictive
analytics. In comparison to on-premises deployments, IDC confirms Amazon EMR provides year 1 savings of 57% and 342% ROI over 5 years.
What is EMR & where is it in the Analytics stack?

EMR powers most cloud Hadoop/Spark projects
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights
reserved.

processes 135B events/day and have cost savings of 60% (~$20M)
decreased costs by $600k in less than 5 months
saves 75% and is 60% more efficient
achieves costs savings of 55% when compared to on-demand
pricing and 40% savings when compared to Reserved Instances
High-impact results with Amazon EMR

near real-time analytics for 140M players
scales 3,000 transient clusters on a daily basis
powers the Predix solution processing 1M data executions/day
computes Zestimates on 100M +homes in hours instead of 1 day
reduced cost of operation and improved Spark performance 3x
High-impact results with Amazon EMR

NinthDecimal is the omnichannel marketing platform
helping Fortune 500 brands identify new prospects and
customers, drive store visits, and increase sales using
AI- and data-driven consumer intelligence.
Ninthdecimal is seeing 3x speedup for Spark workloads
on Amazon EMR and 3-5x of cost reduction. It means
better SLAs for delivering insights to the clients and
improved bottom line of the business.

IMVU is the world’s largest avatar-based social network
serving 6M+ players and 40M+ virtual goods
IMVU has migrated 450+ Spark & Hive jobs and re-
architected monolithic Hadoop environment into
transient Amazon EMR clusters orchestrated with
Airflow pipelines.
By moving to AWS and Amazon EMR saved 30% of
costs and became 80% more efficient in data
engineering and analytics.

57%
reduction in cost of ownership
342%
five-year ROI
8 months
to breakeven
99%
reduction in unplanned downtime
33%
more efficient Big Data teams
46%
more efficient Big Data/Hadoop management staff
Referenced IDC White Paper: "The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR"
IDC study: Hadoop to Amazon EMR migration

Amazon EMR Migration patterns
and Best Practices Overview

Amazon EMR Migration Patterns
On-
Premise
s
Lift & Shift Instance
Right-Sizing
S3 vs.
HDFS
Transient
clusters
● Lift & Shift
a. Low Risk & Lowest migration cost
b. Very high ongoing cost
c. Low business value addition
d. quickest time to market
● Re-Architect - Migrate to Amazon EMR with a new architecture with
complementary services to optimize the cost and to provide additional
functionality, scalability, flexibility etc.
a. Medium risk, Medium Migration cost
b. Medium ongoing cost
c. High business value addition
d. Medium time to market
● Next Gen Architecture - Migrate to Amazon EMR with a completely new
architecture which may include Streaming, Containers with added
functionality, scalability, flexibility etc.
a. High risk, Highest Migration Cost
b. Lowest ongoing cost
c. Highest business value addition
d. Longest time to market

An approach to best practice deployment
Go beyond a lift & shift to optimize for scale and cost.
On-Premises Lift & Shift Instance
Right-Sizing
Amazon
S3 vs.
HDFS
Transient
clusters
Auto-
scaling
Spot
Pricing
Automated
Orchestration
Amazon
EMR
Optimized
True TCO
comparison

Business factors:
Capex->Opex
On-prem license fees
Maintenance Overhead
Uncertainty in Hadoop
Vendors
Lowest pricing comparing to
other Hadoop/Spark premium
vendors
Amazon EMR Value Add:
Decoupled Storage & Compute
Transient clusters
Spot pricing
Autoscaling
Optimised hardware
Amazon S3 lifecycle
Proprietary Spark Amazon
EMR engine
Next Gen Architecture Value Add:
Data Pipelines optimization
Streaming processing
Serverless ETL
Serverless ad-hoc queries
Serverless Data Catalog
Workloads decomposition
(Amazon EMR, Amazon Redshift,
Athena, SageMaker)
10-20% Cost Reduction + 10-40% Reduction + 20-90% Reduction
Overview of Cost Optimization Factors

Migration Risk Mitigation Strategies
On-
Premise
s
Lift & Shift Instance
Right-Sizing
S3 vs.
HDFS
Transient
clusters
Auto-
scaling
Spot
Pricing
Automated
Orchestration
EMR
Optimize
d
True TCO
compariso
n
● Analyze all application and workloads to ascertain
compute, memory, storage, run time of day/week/month
and any other infrastructure needs
● Develop a Business Value and Implementation
Complexity Model for all applications and workloads,
Plot business value vs. complexity Prioritization Matrix
● Organized Mirroring of Data loads on to Amazon EMR
cluster with on-prem Hadoop cluster
● Start moving Workloads on to Amazon EMR in an orderly
fashion.
● Identify excited innovators within each business unit to
promote and spread on-prem to Amazon EMR migration
● Work with experts like Provectus to lead this effort.
Complexity
BusinessValue
A
D
B
C
F
E
G
Initial Workloads to
migrate

1. Build a business case of Amazon EMR Migration including comparative cost
analysis
2. Develop a risk mitigation plan
3. Design Next-Gen Data Platform and its adoption roadmap
4. Hands-on execute migration and re-architecture
How Provectus can help

Cost and other challenges of On-Prem
Hadoop/Spark Environments

Compute and storage growth
Tightly
coupled
● Storage grows along with
compute
● Compute requirements vary
3x
● Data is replicated several times
● Typically only on one data center

Underutilized or scarce resources
40
20
0
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
60
80
100
120
Re-processing
Weekly peaks
Steady state

Contention for the same resources
Compute
bound
Memory
bound

With a monolithic cluster, there may be dependencies of downstream applications that impact
the inability to upgrade versions. By not upgrading, organizations could be limiting innovation.
● Large Scale Transformation: Map/Reduce, Hive, Pig, Spark
● Interactive Queries: Impala, Spark SQL, Presto
● Machine Learning: Spark ML, MxNet, Tensorflow
● Interactive Notebooks: Jupyter, Zeppelin
● NoSQL: HBase
Limited on fast following app versions

Cost Optimization using Amazon EMR

Amazon EMR Benefits
Amazon S3 is your persistent storage - 99.9999999% durability, Low cost and
many varieties, Life cycle policies, Versioning, Distributed by default, and EMRFS
Decouple storage and compute
Turn off the cluster
Auto-scaling | Persistent & transient clusters

Logical separation of jobs/applications
Re-architect Monolithic to Purpose-built
clusters by:
• Creating Transient and/or Persistent clusters
• Separating clusters by Application
• Separating clusters by Application Version
• Isolating Department specific clusters
Design consideration are given to:
• How you submit jobs or build pipelines
• Persisting your data in Amazon S3
• Storing metadata off the cluster
• How long the job runs
• What applications are needed
Purpose-built Clusters
Traditional Monolithic Cluster

Built-in disaster recovery
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Availability Zone

Parallelization on Spot can drastically reduce time-to-insight and cost.
Example 1: Baseline example of using RI
10 node cluster running for 14 hours
Cost = $1.0 * 10 nodes * 14 hours = $140
Example 2: Scale more nodes with Spot
Add 10 more nodes of Spot at 50% discount
20 node cluster running for 7 hours
Cost = $1.0 * 10 nodes * 7 hours = $70
= $0.5 * 10 nodes * 7 hours = $35
Total $105
Auto-scale nodes with Spot instances

● The EMR Runtime for Apache Spark available in Amazon EMR
v5.28 realized Spark improvements of up to 32x against TPC-
DS 3TB dataset in comparison to Amazon EMR v5.16
(reference)
● The Amazon EMR Runtime for Apache Spark maintains API
compatibility with OSS Spark
● More coming every release
Spark performance improvements

Analysts confirm lowest TCO
Feb. 2019, Forrester recognizes
Amazon EMR as the Cloud
Hadoop/Spark (HARK) Leader.
Nov. 2018, IDC report confirms:
“EMR provides 57% reduced costs
vs. on premise resulting in 342%
ROI over 5 years.”
Dec. 2018, Gartner suggests:
“AWS remains the largest
Hadoop provider in terms of
both revenue and user base.”
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and
Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester
Wave™ is a graphical representation of Forrester's call on a market and is
plotted using a detailed spreadsheet with exposed scores, weightings, and
comments. Forrester does not endorse any vendor, product, or service
depicted in the Forrester Wave™. Information is based on best available
resources. Opinions reflect judgment at the time and are subject to change.

Benefits Summary
1. Decoupled compute & storage
2. Built-in disaster recovery
3. Turn off your clusters after use
4. Agility of auto-scaling of the clusters
5. Leverage Spot pricing for unused Amazon EC2 capacity
6. Self-service with AWS Service Catalog
7. Spark performance improvements
8. Fully managed Amazon EMR Notebooks
9. Centralized assets and data pipeline orchestration
10. Lowest TCO in the Industry, analysts confirm
11. Amazon EMR is surrounded by the industry’s broadest
analytics ecosystem

The Next-Gen Ecosystem
that Supports You

Serverless analytics
Amazon S3
Data lake
AWS Glue
(ETL &
Data Catalog)
Athena
QuickSight
Serverless. Zero
infrastructure. Zero
administration
$
Never pay for
idle resources
Availability and
fault tolerance
built in
Automatically
scales resources
with usage
AWS IoT
AI/ML
Devices Web
Sensors
Social

AWS Glue
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python, Scala, and
Spark
Data Catalog
• Glue crawlers automatically discovers data and
stores schema
• Catalog makes data searchable, and available
for ETL and queries
• Computes statistics to make queries efficiently
Serverless ETL & Data Catalogue
ETL
• Generates customizable code for common file
type conversion and partitioning
• Schedules and runs your ETL jobs
• Serverless, flexible, and built on open standards

Amazon Athena
Zero setup cost; just point to
Amazon S3 and start querying
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types, and
complex joins and data
types
Serverless: zero
infrastructure, zero
administration
Integrated with QuickSight
Pay only for queries run;
save 30–90% on per- query
costs through compression
Query Instantly Open EasyPay per query
Serverless Interactive Query engine
• Interactive query service to analyze data in Amazon S3 using standard SQL
• No infrastructure to set up or manage and no data to load
• Ability to run SQL queries on data archived in Amazon S3 Glacier
SQL

90% of your
Hadoop Costs
Hadoop Common Pipeline Pattern 1

90% of your
Hadoop Costs
Hadoop Common Pipeline Pattern 2

2-3x of cost
reduction
From Big Data to Fast Data

Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR

125 University Avenue
Suite 290, Palo Alto
California, 94301
provectus.com
Questions, details?
We would be happy to answer!

Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR

More Related Content

Similar to Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR (10)

More from Provectus (20)

Recently uploaded (20)

Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR