SlideShare a Scribd company logo
1
How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server
Ahsan Javed Awan
EMJD-DC (KTH-UPC)
(https://www.kth.se/profile/ajawan/)
Mats Brorsson(KTH), Vladimir Vlassov(KTH) and Eduard
Ayguade(UPC and BSC),
2
Motivation
Why should we care about architecture support?
*Source: SGI
Data Growing Faster Than Technology
4
Motivation
Cont...
Our FocusOur Focus
Improve the node level performance
through architecture support
*Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
Phoenix ++,
Metis, Ostrich,
etc..
Hadoop, Spark,
Flink, etc..
5
Motivation
Conti...
● A mismatch between the characteristics of emerging workloads and the underlying
hardware.
– M. Ferdman et-al, “Clearing the clouds: A study of emerging scale-out workloads
on modern hardware,” in ASPLOS 2012.
– Z. Jia, et-al “Characterizing data analysis workloads in data centers,” in IISWC
2013.
– Z. Jia et-al, “Characterizing and subsetting big data workloads,” in IISWC 2014
– A. Yasin et-al, “Deep-dive analysis of the data analytics workload in cloudsuite,” in
IISWC 2014.
– T. Jiang, et-al, “Understanding the behavior of in-memory computing workloads,” in
IISWC 2014
Existing studies lack quantitative analysis of bottlenecks of
scale-out frameworks on single-node
6
Progress Meeting 12-12-14
Which Scale-out Framework ?
[Picture Courtesy: Amir H. Payberah]
7
Our Approach
● Performance characterization of in-memory data analytics on a
modern cloud server,” in 5th
International IEEE Conference on
Big Data and Cloud Computing, 2015 (Best Paper Award).
● How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server
What are the major bottlenecks??
Focus of this talk
8
Our Approach
● Do Spark based data analytics benefit from using scale-up
servers?
● How severe is the impact of garbage collection on performance
of Spark based data analytics?
● Is file I/O detrimental to Spark based data analytics
performance?
● How does data size affect the micro-architecture performance
of Spark based data analytics?
What are the remaining questions??
9
Our Approach
● We evaluate the impact of data volume on the performance of
Spark based data analytics running on a scale-up server.
● We quantify the limitations of using Spark on a scale-up server
with large volumes of data.
● We quantify the variations in micro-architectural performance of
applications across different data volumes.
What are the contributions??
10
Our Approach
● Use a subset of benchmarks from BigDataBench
● Use Big Data Generator Suite (BDGS), to generate synthetic
datasets of 6 GB, 12 GB and 24 GB.
● Configure Spark in local mode and tune its internal Parameters
● Rely on GC logs to collect garbage collection times.
● Use Spark logs to gather execution time of benchmarks.
● Use Concurrency Analysis in Intel Vtune to collect wait time and CPU
time of executor pool threads
● Use General Micro-architectural Exploration in Intel Vtune to analyze
impact of data volume on micro-architecture characteristics.
Methodology
11
Our Approach
What are the characteristics of benchmarks?
12
Our Hardware Configuration
System Details
13
Our Hardware Configuration
Machine Details
Hyper Threading and Turbo-boost are disabled
Hyper Threading and Turbo-boost are disabled
14
Our Approach
Software Parameters
15
Motivation
Do Spark based data analytics benefit from using larger
scale-up servers?
Spark applications do not benefit significantly by using more than 12-core executors
16
Motivation
Is GC detrimental to scalability of Spark applications?
The proportion of GC time increases with the number of cores
17
Motivation
Does performance remain consistent as we enlarge the data
size ?
Decrease in Data processed per second ranges from 11% to 93% ( Parallel Scavenge)
18
Motivation
Does the choice of Garbage Collector impact the data
processing capability of the system ??
Improvement in DPS ranges from 1.4x to 3.7x on average
in Parallel Scavenge as compared to G1
19
Motivation
How does GC affect data processing capability of
the system ??
GC time does not scale linearly with data size.
20
Motivation
How does CPU utilization scale with data volume ?
CPU Utilization decreases with increase in input data size
21
Motivation
Is File I/O detrimental to performance ?
Fraction of file I/O increases by 6x, 18x and 25x for Word Count,
Naive Bayes and Sort respectively when input data is increased by 4x
22
Motivation
How does data size affects micro-architectural
performance ?
5 to 10 % better instruction retirement as we enlarge the data size
23
Motivation
Cont..
Execution units inside the core exhibit improved utilization at larger data sets
24
Motivation
Cont..
Increase in L1 Bound Stalls implies better utilization of L1 Caches
25
Motivation
Cont..
Spark benchmarks exhibit reduced memory bandwidth utilization
26
Key Findings
● Spark workloads do not benefit significantly from executors with
more than 12 cores.
● The performance of Spark workloads degrades with large volumes
of data due to substantial increase in garbage collection and file
I/O time.
● With out any tuning, Parallel Scavenge garbage collection scheme
outperforms Concurrent Mark Sweep and G1 garbage collectors
for Spark workloads.
● Spark workloads exhibit improved instruction retirement due to
lower L1 cache misses and better utilization of functional units
inside cores at large volumes of data.
● Memory bandwidth utilization of Spark benchmarks decreases
with large volumes of data and is 3x lower than the available off-
chip bandwidth on our test machine
27
Motivation
Future Directions
NUMA Aware Task Scheduling
Cache Aware Transformations
Exploiting Processing In Memory Architectures
HW/SW Data Prefectching
Rethinking Memory Architectures

More Related Content

ODP
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...
PDF
Identifying the Potential of Near Data Processing for Apache Spark
PPTX
Movie data analysis
PDF
Modern Scientific Data Management Practices: The Atmospheric Radiation Measur...
PPTX
Hadoop Introduction
PPTX
Expect More from Hadoop
PDF
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
PDF
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...
Identifying the Potential of Near Data Processing for Apache Spark
Movie data analysis
Modern Scientific Data Management Practices: The Atmospheric Radiation Measur...
Hadoop Introduction
Expect More from Hadoop
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...

What's hot (20)

PPTX
An Overview of VIEW
PDF
Scientific Application Development and Early results on Summit
PDF
MSc Big Data: Connectomics Talk
PDF
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
PPTX
Big Data and its emergence
PPTX
Big Data Benchmarking
PPTX
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
PPTX
Big data analytics
PPTX
PDF
Data Partitioning in Mongo DB with Cloud
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPTX
Matching Data Intensive Applications and Hardware/Software Architectures
PPTX
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
PDF
IBM POWER - An ideal platform for scale-out deployments
PDF
Performance and Energy evaluation
PDF
Big Telco - Yousun Jeong
PDF
From hadoop to spark
PDF
Real time big data analytical architecture for remote sensing application
PDF
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
PPTX
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
An Overview of VIEW
Scientific Application Development and Early results on Summit
MSc Big Data: Connectomics Talk
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Big Data and its emergence
Big Data Benchmarking
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
Big data analytics
Data Partitioning in Mongo DB with Cloud
Big Data Analytics Projects - Real World with Pentaho
Matching Data Intensive Applications and Hardware/Software Architectures
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
IBM POWER - An ideal platform for scale-out deployments
Performance and Energy evaluation
Big Telco - Yousun Jeong
From hadoop to spark
Real time big data analytical architecture for remote sensing application
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Ad

Viewers also liked (20)

DOC
China desulfurization equipment industry market research and investment forec...
PPTX
DOC
China investment attracting pattern and regional promotion planning report, 2...
DOC
China pharmaceutical excipients industry indepth research and investment stra...
PPTX
We Need More Legal Hackers Now!
PPTX
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA Y LA IDENTIDAD INSTITUCIONAL
DOC
China coal industry development trend and investment strategic decision repor...
PDF
China banking industry market research and prospect forecast report
DOC
China high end equipment manufacturing park development pattern and investmen...
PPTX
Business plan
PPTX
Problem 1
PPTX
Genre research
PPTX
AppsNgen
DOC
China automated warehouse industry investment demand and development prospect...
PPTX
Tugas B.Inggris Pekan 1
PPTX
COMO TRABAJO Y APLICO MIS COMPETENCIAS
PPTX
Welcome to 5th grade
PDF
Open source as a convivial and democratic mode of production
PPTX
South Korea
China desulfurization equipment industry market research and investment forec...
China investment attracting pattern and regional promotion planning report, 2...
China pharmaceutical excipients industry indepth research and investment stra...
We Need More Legal Hackers Now!
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA Y LA IDENTIDAD INSTITUCIONAL
China coal industry development trend and investment strategic decision repor...
China banking industry market research and prospect forecast report
China high end equipment manufacturing park development pattern and investmen...
Business plan
Problem 1
Genre research
AppsNgen
China automated warehouse industry investment demand and development prospect...
Tugas B.Inggris Pekan 1
COMO TRABAJO Y APLICO MIS COMPETENCIAS
Welcome to 5th grade
Open source as a convivial and democratic mode of production
South Korea
Ad

Similar to How Data Volume Affects Spark Based Data Analytics on a Scale-up Server (20)

PDF
Spark Summit EU talk by Ahsan Javed Awan
PDF
Performance Characterization and Optimization of In-Memory Data Analytics on ...
PDF
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
PDF
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
PDF
Boosting spark performance: An Overview of Techniques
PDF
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
PPTX
Stories About Spark, HPC and Barcelona by Jordi Torres
PPTX
Empower Data-Driven Organizations
PDF
Building a High Performance Analytics Platform
PDF
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
PDF
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
PDF
Big data trends challenges opportunities
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PDF
Deploying Apache Spark and testing big data applications on servers powered b...
PDF
There is more to Big Data than data
PDF
Spark1.0での動作検証 - Hadoopユーザ・デベロッパから見たSparkへの期待 (Hadoop Conference Japan 2014)
PPTX
Flashy prefetching for high performance flash drives
PDF
Lessons from Running Large Scale Spark Workloads
Spark Summit EU talk by Ahsan Javed Awan
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Boosting spark performance: An Overview of Techniques
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Stories About Spark, HPC and Barcelona by Jordi Torres
Empower Data-Driven Organizations
Building a High Performance Analytics Platform
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Big data trends challenges opportunities
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Deploying Apache Spark and testing big data applications on servers powered b...
There is more to Big Data than data
Spark1.0での動作検証 - Hadoopユーザ・デベロッパから見たSparkへの期待 (Hadoop Conference Japan 2014)
Flashy prefetching for high performance flash drives
Lessons from Running Large Scale Spark Workloads

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Global journeys: estimating international migration
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
A Quantitative-WPS Office.pptx research study
PPTX
Introduction to machine learning and Linear Models
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Logistic Regression ml machine learning.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Acumen Training GuidePresentation.pptx
Quality review (1)_presentation of this 21
Supervised vs unsupervised machine learning algorithms
STUDY DESIGN details- Lt Col Maksud (21).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IB Computer Science - Internal Assessment.pptx
Taxes Foundatisdcsdcsdon Certificate.pdf
Fluorescence-microscope_Botany_detailed content
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Global journeys: estimating international migration
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
A Quantitative-WPS Office.pptx research study
Introduction to machine learning and Linear Models
Introduction to Knowledge Engineering Part 1
Logistic Regression ml machine learning.pptx

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

  • 1. 1 How Data Volume Affects Spark Based Data Analytics on a Scale-up Server Ahsan Javed Awan EMJD-DC (KTH-UPC) (https://www.kth.se/profile/ajawan/) Mats Brorsson(KTH), Vladimir Vlassov(KTH) and Eduard Ayguade(UPC and BSC),
  • 2. 2 Motivation Why should we care about architecture support? *Source: SGI Data Growing Faster Than Technology
  • 3. 4 Motivation Cont... Our FocusOur Focus Improve the node level performance through architecture support *Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/ Phoenix ++, Metis, Ostrich, etc.. Hadoop, Spark, Flink, etc..
  • 4. 5 Motivation Conti... ● A mismatch between the characteristics of emerging workloads and the underlying hardware. – M. Ferdman et-al, “Clearing the clouds: A study of emerging scale-out workloads on modern hardware,” in ASPLOS 2012. – Z. Jia, et-al “Characterizing data analysis workloads in data centers,” in IISWC 2013. – Z. Jia et-al, “Characterizing and subsetting big data workloads,” in IISWC 2014 – A. Yasin et-al, “Deep-dive analysis of the data analytics workload in cloudsuite,” in IISWC 2014. – T. Jiang, et-al, “Understanding the behavior of in-memory computing workloads,” in IISWC 2014 Existing studies lack quantitative analysis of bottlenecks of scale-out frameworks on single-node
  • 5. 6 Progress Meeting 12-12-14 Which Scale-out Framework ? [Picture Courtesy: Amir H. Payberah]
  • 6. 7 Our Approach ● Performance characterization of in-memory data analytics on a modern cloud server,” in 5th International IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award). ● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server What are the major bottlenecks?? Focus of this talk
  • 7. 8 Our Approach ● Do Spark based data analytics benefit from using scale-up servers? ● How severe is the impact of garbage collection on performance of Spark based data analytics? ● Is file I/O detrimental to Spark based data analytics performance? ● How does data size affect the micro-architecture performance of Spark based data analytics? What are the remaining questions??
  • 8. 9 Our Approach ● We evaluate the impact of data volume on the performance of Spark based data analytics running on a scale-up server. ● We quantify the limitations of using Spark on a scale-up server with large volumes of data. ● We quantify the variations in micro-architectural performance of applications across different data volumes. What are the contributions??
  • 9. 10 Our Approach ● Use a subset of benchmarks from BigDataBench ● Use Big Data Generator Suite (BDGS), to generate synthetic datasets of 6 GB, 12 GB and 24 GB. ● Configure Spark in local mode and tune its internal Parameters ● Rely on GC logs to collect garbage collection times. ● Use Spark logs to gather execution time of benchmarks. ● Use Concurrency Analysis in Intel Vtune to collect wait time and CPU time of executor pool threads ● Use General Micro-architectural Exploration in Intel Vtune to analyze impact of data volume on micro-architecture characteristics. Methodology
  • 10. 11 Our Approach What are the characteristics of benchmarks?
  • 12. 13 Our Hardware Configuration Machine Details Hyper Threading and Turbo-boost are disabled Hyper Threading and Turbo-boost are disabled
  • 14. 15 Motivation Do Spark based data analytics benefit from using larger scale-up servers? Spark applications do not benefit significantly by using more than 12-core executors
  • 15. 16 Motivation Is GC detrimental to scalability of Spark applications? The proportion of GC time increases with the number of cores
  • 16. 17 Motivation Does performance remain consistent as we enlarge the data size ? Decrease in Data processed per second ranges from 11% to 93% ( Parallel Scavenge)
  • 17. 18 Motivation Does the choice of Garbage Collector impact the data processing capability of the system ?? Improvement in DPS ranges from 1.4x to 3.7x on average in Parallel Scavenge as compared to G1
  • 18. 19 Motivation How does GC affect data processing capability of the system ?? GC time does not scale linearly with data size.
  • 19. 20 Motivation How does CPU utilization scale with data volume ? CPU Utilization decreases with increase in input data size
  • 20. 21 Motivation Is File I/O detrimental to performance ? Fraction of file I/O increases by 6x, 18x and 25x for Word Count, Naive Bayes and Sort respectively when input data is increased by 4x
  • 21. 22 Motivation How does data size affects micro-architectural performance ? 5 to 10 % better instruction retirement as we enlarge the data size
  • 22. 23 Motivation Cont.. Execution units inside the core exhibit improved utilization at larger data sets
  • 23. 24 Motivation Cont.. Increase in L1 Bound Stalls implies better utilization of L1 Caches
  • 24. 25 Motivation Cont.. Spark benchmarks exhibit reduced memory bandwidth utilization
  • 25. 26 Key Findings ● Spark workloads do not benefit significantly from executors with more than 12 cores. ● The performance of Spark workloads degrades with large volumes of data due to substantial increase in garbage collection and file I/O time. ● With out any tuning, Parallel Scavenge garbage collection scheme outperforms Concurrent Mark Sweep and G1 garbage collectors for Spark workloads. ● Spark workloads exhibit improved instruction retirement due to lower L1 cache misses and better utilization of functional units inside cores at large volumes of data. ● Memory bandwidth utilization of Spark benchmarks decreases with large volumes of data and is 3x lower than the available off- chip bandwidth on our test machine
  • 26. 27 Motivation Future Directions NUMA Aware Task Scheduling Cache Aware Transformations Exploiting Processing In Memory Architectures HW/SW Data Prefectching Rethinking Memory Architectures