SlideShare a Scribd company logo
Brought to you by
Get Lower Latency and
Higher Throughput for
Java Applications
Simon Ritter
Deputy CTO at
Simon Ritter
Deputy CTO
■ Java Champion and two times JavaOne Rockstar
■ 99th
Percentile is the hard part of performance
■ Away from work, my son and I are restoring a Classic Mini
JVM Performance Challenges
■ Latency
● Biggest issue is Garbage Collection
● Stop-the-world pauses for almost all collectors
● Pauses are typically proportional to heap size, not live data
■ Throughput
● Adaptive JIT compilation: Interpreted, C1 compiled, C2 compiled
● Deoptimisations
● Level of optimisation is key
■ Warmup
● Time taken to get to fully optimised code for all hot methods
● Restart of an application requires the same warmup work to be carried out
Azul Platform Prime: An Alternative JVM
■ Based on OpenJDK source code
■ Passes all Java SE TCK/JCK tests
● Drop-in replacement for other JVMs
● No application code changes, no recompilation
■ Hotspot collectors replaced with C4
■ C2 JIT compiler replaced with Falcon
■ ReadyNow! warm up elimination technology
Azul Continuous Concurrent Compacting
Collector (C4)
C4 Basics
■ Generational (young and old)
● Uses the same GC collector for both
● For efficiency rather than pause containment
■ All phases are parallel
■ No STW compacting fallback
● Heap scales from 512Mb to 12Tb (with no change to GC latency)
■ Algorithm is mark, relocate, remap
■ Only supported on Linux
● Sophisticated OS memory management interaction
Loaded Value Barrier
■ Read barrier
● Tests all object references as they are loaded
■ Enforces two invariants
● Reference is marked through
● Reference points to correct object position
■ Minimal performance overhead
● Test and jump (2 instructions)
● x86 architecture reduces this to one micro-op
Concurrent Mark Phase
Root Set
GC Threads
App Threads
X
X
X
X
X
Relocation Phase
Compaction
A B C D E
A’ B’ C’ D’ E’
A -> A’ B -> B’ C -> C’ D -> D’ E -> E’
Remapping Phase
App Threads
GC Threads
A -> A’ B -> B’ C -> C’ D -> D’ E -> E’
X
X
X
Measuring Platform Performance
■ jHiccup
■ Spends most of its time asleep
● Minimal effect on performance
● Wakes every 1 ms
● Records delta of time it expects to wake up
● Measured effect is what would be experienced by your application
■ Generates histogram log files
● These can be graphed for easy evaluation
Eliminating ElasticSearch Latency
HotSpot Azul Prime
128Gb heap
Prime:128GB:
Prime:128GB:
Eliminating ElasticSearch Latency
HotSpot Azul Prime
128Gb heap
Prime:128GB:
Prime:128GB:
Azul Falcon JIT Compiler
Advancing Adaptive Compilation
■ Replacement for C2 JIT compiler
■ Azul Falcon compiler
● Based on latest compiler research
● LLVM project
■ Better performance
● Better intrinsics
● More inlining
● Fewer compiler excludes
Vector Code Example
■ Conditional array cell addition loop
● Hard for compiler to identify for vector instruction use
private void addArraysIfEven(int a[], int b[]) {
if (a.length != b.length)
throw new RuntimeException("length mismatch");
for (int i = 0; i < a.length; i++)
if ((b[i] & 0x1) == 0)
a[i] += b[i];
}
Traditional JVM JIT
Per element jumps
2 elements per iteration
Falcon JIT
Using AVX2 vector instructions
32 elements per iteration
Broadwell E5-2690-v4
Recent Customer Success Story
■ Leading cloud-based IT security company
● Cloud security, compliance and other services
■ Big Kafka user
● 2.5 billion messages across Kafka clusters daily
● Initially approached us about their Cassandra clusters and eliminating latency
■ Kafka improvements
● 20% performance gain, out-of-the-box, with no tuning
● Falcon improved code generation
● Resulted in a 15% saving in cloud hardware costs
● Platform Core was effectively cheaper than free!
ReadyNow! Warmup Elimination Technology
■ Save JVM JIT profiling information
● Classes loaded
● Classes initialised
● Instruction profiling data
● Speculative optimisation failure data
■ Data can be gathered over much longer period
● JVM/JIT profiles quickly
● Significant reduction in deoptimisations
■ Able to load, initialise and compile most code before main()
Impact on Latency
Before
After
Compile Stashing Effect
Performance
Time
Performance
Time
Without Compile Stashing
With Compile Stashing
Up to 80% reduction in compile time
and 60% reduction in CPU load
Summary
Improving Java Performance
■ Collect and re-use profiles to reduce warm-up time
■ Use alternative JIT compilation strategies
■ Eliminate GC STW pauses through use of read-barrier
■ Azul working to deliver better Java performance.
Brought to you by
Simon Ritter
sritter@azul.com
@speakjava

More Related Content

PDF
Using eBPF to Measure the k8s Cluster Health
PDF
Data Structures for High Resolution, Real-time Telemetry at Scale
PDF
Where Did All These Cycles Go?
PDF
Continuous Performance Regression Testing with JfrUnit
PDF
Crimson: Ceph for the Age of NVMe and Persistent Memory
PDF
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
PDF
Let’s Fix Logging Once and for All
PDF
High-Performance Networking Using eBPF, XDP, and io_uring
Using eBPF to Measure the k8s Cluster Health
Data Structures for High Resolution, Real-time Telemetry at Scale
Where Did All These Cycles Go?
Continuous Performance Regression Testing with JfrUnit
Crimson: Ceph for the Age of NVMe and Persistent Memory
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
Let’s Fix Logging Once and for All
High-Performance Networking Using eBPF, XDP, and io_uring

What's hot (20)

PDF
Keeping Latency Low and Throughput High with Application-level Priority Manag...
PDF
Whoops! I Rewrote It in Rust
PDF
Continuous Go Profiling & Observability
PDF
DB Latency Using DRAM + PMem in App Direct & Memory Modes
PDF
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
PDF
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
PDF
Understanding Apache Kafka P99 Latency at Scale
PDF
Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...
PDF
OSNoise Tracer: Who Is Stealing My CPU Time?
PDF
Spying on the Linux kernel for fun and profit
POTX
Performance Tuning EC2 Instances
PDF
RxNetty vs Tomcat Performance Results
PDF
ACM Applicative System Methodology 2016
PDF
RISC-V on Edge: Porting EVE and Alpine Linux to RISC-V
PDF
Linux Performance 2018 (PerconaLive keynote)
PDF
New Ways to Find Latency in Linux Using Tracing
PDF
Rust Is Safe. But Is It Fast?
PDF
YOW2021 Computing Performance
PDF
YOW2018 Cloud Performance Root Cause Analysis at Netflix
PPTX
Modern Linux Tracing Landscape
Keeping Latency Low and Throughput High with Application-level Priority Manag...
Whoops! I Rewrote It in Rust
Continuous Go Profiling & Observability
DB Latency Using DRAM + PMem in App Direct & Memory Modes
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Understanding Apache Kafka P99 Latency at Scale
Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...
OSNoise Tracer: Who Is Stealing My CPU Time?
Spying on the Linux kernel for fun and profit
Performance Tuning EC2 Instances
RxNetty vs Tomcat Performance Results
ACM Applicative System Methodology 2016
RISC-V on Edge: Porting EVE and Alpine Linux to RISC-V
Linux Performance 2018 (PerconaLive keynote)
New Ways to Find Latency in Linux Using Tracing
Rust Is Safe. But Is It Fast?
YOW2021 Computing Performance
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Modern Linux Tracing Landscape
Ad

Similar to Get Lower Latency and Higher Throughput for Java Applications (20)

PPTX
Building a Better JVM
PPTX
Keeping Your Java Hot by Solving the JVM Warmup Problem
PDF
JVM Mechanics: A Peek Under the Hood
PPTX
JVM @ Taobao - QCon Hangzhou 2011
PPT
Best Practices for performance evaluation and diagnosis of Java Applications ...
PDF
The Art of Java Benchmarking
PPTX
Jvm problem diagnostics
KEY
JavaOne 2012 - JVM JIT for Dummies
PPTX
Java Jit. Compilation and optimization by Andrey Kovalenko
PDF
Seminar.2009.Performance.Intro
PDF
JVM Mechanics: When Does the JVM JIT & Deoptimize?
PPT
Optimizing your java applications for multi core hardware
PDF
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
PDF
What's Inside a JVM?
PDF
Silicon Valley JUG: JVM Mechanics
PDF
Game of Performance: A Song of JIT and GC
PDF
Eclipse Day India 2015 - Java bytecode analysis and JIT
PDF
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
KEY
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
PPTX
Clr jvm implementation differences
Building a Better JVM
Keeping Your Java Hot by Solving the JVM Warmup Problem
JVM Mechanics: A Peek Under the Hood
JVM @ Taobao - QCon Hangzhou 2011
Best Practices for performance evaluation and diagnosis of Java Applications ...
The Art of Java Benchmarking
Jvm problem diagnostics
JavaOne 2012 - JVM JIT for Dummies
Java Jit. Compilation and optimization by Andrey Kovalenko
Seminar.2009.Performance.Intro
JVM Mechanics: When Does the JVM JIT & Deoptimize?
Optimizing your java applications for multi core hardware
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
What's Inside a JVM?
Silicon Valley JUG: JVM Mechanics
Game of Performance: A Song of JIT and GC
Eclipse Day India 2015 - Java bytecode analysis and JIT
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Clr jvm implementation differences
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
20250228 LYD VKU AI Blended-Learning.pptx
A comparative analysis of optical character recognition models for extracting...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine learning based COVID-19 study performance prediction
Unlocking AI with Model Context Protocol (MCP)
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Get Lower Latency and Higher Throughput for Java Applications

  • 1. Brought to you by Get Lower Latency and Higher Throughput for Java Applications Simon Ritter Deputy CTO at
  • 2. Simon Ritter Deputy CTO ■ Java Champion and two times JavaOne Rockstar ■ 99th Percentile is the hard part of performance ■ Away from work, my son and I are restoring a Classic Mini
  • 3. JVM Performance Challenges ■ Latency ● Biggest issue is Garbage Collection ● Stop-the-world pauses for almost all collectors ● Pauses are typically proportional to heap size, not live data ■ Throughput ● Adaptive JIT compilation: Interpreted, C1 compiled, C2 compiled ● Deoptimisations ● Level of optimisation is key ■ Warmup ● Time taken to get to fully optimised code for all hot methods ● Restart of an application requires the same warmup work to be carried out
  • 4. Azul Platform Prime: An Alternative JVM ■ Based on OpenJDK source code ■ Passes all Java SE TCK/JCK tests ● Drop-in replacement for other JVMs ● No application code changes, no recompilation ■ Hotspot collectors replaced with C4 ■ C2 JIT compiler replaced with Falcon ■ ReadyNow! warm up elimination technology
  • 5. Azul Continuous Concurrent Compacting Collector (C4)
  • 6. C4 Basics ■ Generational (young and old) ● Uses the same GC collector for both ● For efficiency rather than pause containment ■ All phases are parallel ■ No STW compacting fallback ● Heap scales from 512Mb to 12Tb (with no change to GC latency) ■ Algorithm is mark, relocate, remap ■ Only supported on Linux ● Sophisticated OS memory management interaction
  • 7. Loaded Value Barrier ■ Read barrier ● Tests all object references as they are loaded ■ Enforces two invariants ● Reference is marked through ● Reference points to correct object position ■ Minimal performance overhead ● Test and jump (2 instructions) ● x86 architecture reduces this to one micro-op
  • 8. Concurrent Mark Phase Root Set GC Threads App Threads X X X X X
  • 9. Relocation Phase Compaction A B C D E A’ B’ C’ D’ E’ A -> A’ B -> B’ C -> C’ D -> D’ E -> E’
  • 10. Remapping Phase App Threads GC Threads A -> A’ B -> B’ C -> C’ D -> D’ E -> E’ X X X
  • 11. Measuring Platform Performance ■ jHiccup ■ Spends most of its time asleep ● Minimal effect on performance ● Wakes every 1 ms ● Records delta of time it expects to wake up ● Measured effect is what would be experienced by your application ■ Generates histogram log files ● These can be graphed for easy evaluation
  • 12. Eliminating ElasticSearch Latency HotSpot Azul Prime 128Gb heap Prime:128GB: Prime:128GB:
  • 13. Eliminating ElasticSearch Latency HotSpot Azul Prime 128Gb heap Prime:128GB: Prime:128GB:
  • 14. Azul Falcon JIT Compiler
  • 15. Advancing Adaptive Compilation ■ Replacement for C2 JIT compiler ■ Azul Falcon compiler ● Based on latest compiler research ● LLVM project ■ Better performance ● Better intrinsics ● More inlining ● Fewer compiler excludes
  • 16. Vector Code Example ■ Conditional array cell addition loop ● Hard for compiler to identify for vector instruction use private void addArraysIfEven(int a[], int b[]) { if (a.length != b.length) throw new RuntimeException("length mismatch"); for (int i = 0; i < a.length; i++) if ((b[i] & 0x1) == 0) a[i] += b[i]; }
  • 17. Traditional JVM JIT Per element jumps 2 elements per iteration
  • 18. Falcon JIT Using AVX2 vector instructions 32 elements per iteration Broadwell E5-2690-v4
  • 19. Recent Customer Success Story ■ Leading cloud-based IT security company ● Cloud security, compliance and other services ■ Big Kafka user ● 2.5 billion messages across Kafka clusters daily ● Initially approached us about their Cassandra clusters and eliminating latency ■ Kafka improvements ● 20% performance gain, out-of-the-box, with no tuning ● Falcon improved code generation ● Resulted in a 15% saving in cloud hardware costs ● Platform Core was effectively cheaper than free!
  • 20. ReadyNow! Warmup Elimination Technology ■ Save JVM JIT profiling information ● Classes loaded ● Classes initialised ● Instruction profiling data ● Speculative optimisation failure data ■ Data can be gathered over much longer period ● JVM/JIT profiles quickly ● Significant reduction in deoptimisations ■ Able to load, initialise and compile most code before main()
  • 22. Compile Stashing Effect Performance Time Performance Time Without Compile Stashing With Compile Stashing Up to 80% reduction in compile time and 60% reduction in CPU load
  • 24. Improving Java Performance ■ Collect and re-use profiles to reduce warm-up time ■ Use alternative JIT compilation strategies ■ Eliminate GC STW pauses through use of read-barrier ■ Azul working to deliver better Java performance.
  • 25. Brought to you by Simon Ritter [email protected] @speakjava