SlideShare a Scribd company logo
PERFORMANCE AND
PREDICTABILITY
Richard Warburton
@richardwarburto
insightfullogic.com
Performance and predictability
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
Performance and predictability
Performance Discussion
Product Solutions
“Just use our library/tool/framework, and everything is web-scale!”
Architecture Advocacy
“Always design your software like this.”
Methodology & Fundamentals
“Here are some principles and knowledge, use your brain“
Do you care?
DON'T look at access patterns first
Many problems not Pattern Related
Networking
Database or External Service
Minimising I/O
Garbage Collection
Insufficient Parallelism
Harvest the low hanging fruit first
Low Hanging is a cost-benefit analysis
So when does it matter?
Informed Design & Architecture *
* this is not a call for premature optimisation
Performance and predictability
That 10% that actually matters
Performance of Predictable Accesses
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
What 4 things do CPUs actually do?
Fetch, Decode, Execute, Writeback
Pipelining speeds this up
Pipelined
What about branches?
public static int simple(int x, int y, int z) {
int ret;
if (x > 5) {
ret = y + z;
} else {
ret = y;
}
return ret;
}
Branches cause stalls, stalls kill performance
Super-pipelined & Superscalar
Can we eliminate branches?
Naive Compilation
if (x > 5) {
ret = z;
} else {
ret = y;
}
cmp x, 5
jmp L1
mov z, ret
br L2
L1: mov y, ret
L2: ...
Conditional Branches
if (x > 5) {
ret = z;
} else {
ret = y;
}
cmp x, 5
mov z, ret
cmovle y, ret
Strategy: predict branches and speculatively
execute
Static Prediction
A forward branch defaults to not taken
A backward branch defaults to taken
Performance and predictability
Static Hints (Pentium 4 or later)
__emit 0x3E defaults to taken
__emit 0x2E defaults to not taken
don’t use them, flip the branch
Dynamic prediction: record history and
predict future
Branch Target Buffer (BTB)
a log of the history of each branch
also stores the address
its finite!
Local
record per conditional branch histories
Global
record shared history of conditional jumps
Loop
specialised predictor when there’s a loop (jumping in a
cycle n times)
Function
specialised buffer for predicted nearby function returns
Two level Adaptive Predictor
accounts for up patterns of up to 3 if statements
Measurement and Monitoring
Use Performance Event Counters (Model Specific
Registers)
Can be configured to store branch prediction
information
Profilers & Tooling: perf (linux), VTune, AMD Code
Analyst, Visual Studio, Oracle Performance Studio
Approach: Loop Unrolling
for (int x = 0; x < 100; x+=5)
{
foo(x);
foo(x+1);
foo(x+2);
foo(x+3);
foo(x+4);
}
Approach: minimise branches in loops
Approach: Separation of Concerns
Little branching on a latency critical path or thread
Heavily branchy administrative code on a different path
Helps adjust for the inflight branch prediction not tracking many
targets
Only applicable with low context switching rates
Also easier to understand the code
One more thing ...
Division/Modulus
index = (index + 1) % SIZE;
vs
index = index + 1;
if (index == SIZE) {
index = 0;
}
/ or % are 92 Cycles on Core i*
Summary
CPUs are Super-pipelined and Superscalar
Branches cause stalls
Branches can be removed, minimised or simplified
Frequently get easier to read code as well
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
The Problem Very Fast
Relatively Slow
The Solution: CPU Cache
Core Demands Data, looks at its cache
If present (a "hit") then data returned to register
If absent (a "miss") then data looked up from
memory and stored in the cache
Fast memory is expensive, a small amount is affordable
Multilevel Cache: Intel Sandybridge
Shared Level 3 Cache
Level 2 Cache
Level 1
Data
Cache
Level 1
Instruction
Cache
Physical Core 0
HT: 2 Logical Cores
....
Level 2 Cache
Level 1
Data
Cache
Level 1
Instruction
Cache
Physical Core N
HT: 2 Logical Cores
How bad is a miss?
Location Latency in Clockcycles
Register 0
L1 Cache 3
L2 Cache 9
L3 Cache 21
Main Memory 150-400
Cache Lines
Data transferred in cache lines
Fixed size block of memory
Usually 64 bytes in current x86 CPUs
Between 32 and 256 bytes
Purely hardware consideration
Prefetch = Eagerly load data
Adjacent Cache Line Prefetch
Data Cache Unit (Streaming) Prefetch
Problem: CPU Prediction isn't perfect
Solution: Arrange Data so accesses are
predictable
Prefetching
Sequential Locality
Referring to data that is arranged linearly in memory
Spatial Locality
Referring to data that is close together in memory
Temporal Locality
Repeatedly referring to same data in a short time span
General Principles
Use smaller data types (-XX:+UseCompressedOops)
Avoid 'big holes' in your data
Make accesses as linear as possible
Primitive Arrays
// Sequential Access = Predictable
for (int i=0; i<someArray.length; i++)
someArray[i]++;
Primitive Arrays - Skipping Elements
// Holes Hurt
for (int i=0; i<someArray.length; i += SKIP)
someArray[i]++;
Primitive Arrays - Skipping Elements
Multidimensional Arrays
Multidimensional Arrays are really Arrays of
Arrays in Java. (Unlike C)
Some people realign their accesses:
for (int col=0; col<COLS; col++) {
for (int row=0; row<ROWS; row++) {
array[ROWS * col + row]++;
}
}
Bad Access Alignment
Strides the wrong way, bad
locality.
array[COLS * row + col]++;
Strides the right way, good
locality.
array[ROWS * col + row]++;
Full Random Access
L1D - 5 clocks
L2 - 37 clocks
Memory - 280 clocks
Sequential Access
L1D - 5 clocks
L2 - 14 clocks
Memory - 28 clocks
Primitive Collections (GNU Trove, GS-Coll)
Arrays > Linked Lists
Hashtable > Search Tree
Avoid Code bloating (Loop Unrolling)
Custom Data Structures
Judy Arrays
an associative array/map
kD-Trees
generalised Binary Space Partitioning
Z-Order Curve
multidimensional data in one dimension
Data Layout Principles
Data Locality vs Java Heap Layout
0
1
2
class Foo {
Integer count;
Bar bar;
Baz baz;
}
// No alignment guarantees
for (Foo foo : foos) {
foo.count = 5;
foo.bar.visit();
}
3
...
Foo
bar
baz
count
Data Locality vs Java Heap Layout
Serious Java Weakness
Location of objects in memory hard to
guarantee.
GC also interferes
Copying
Compaction
Summary
Cache misses cause stalls, which kill performance
Measurable via Performance Event Counters
Common Techniques for optimizing code
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
Hard Disks
Commonly used persistent storage
Spinning Rust, with a head to read/write
Constant Angular Velocity - rotations per minute stays
constant
A simple model
Zone Constant Angular Velocity (ZCAV) /
Zoned Bit Recording (ZBR)
Operation Time =
Time to process the command
Time to seek
Rotational speed latency
Sequential Transfer TIme
ZBR implies faster transfer at limits than
centre
Seeking vs Sequential reads
Seek and Rotation times dominate on small values of
data
Random writes of 4kb can be 300 times slower than
theoretical max data transfer
Consider the impact of context switching between
applications or threads
Fragmentation causes unnecessary seeks
Sector alignment is disk-level irrelevant
Performance and predictability
Performance and predictability
Summary
Simple, sequential access patterns win
Fragmentation is your enemy
Alignment can be relevant in RAID scenarios
SSDs are a different game
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
Software runs on Hardware, play nice
Predictability is a running theme
Huge theoretical speedups
YMMV: measured incremental improvements
>>> going nuts
More information
Articles
http://www.akkadia.org/drepper/cpumemory.pdf
https://gmplib.org/~tege/x86-timing.pdf
http://psy-lob-saw.blogspot.co.uk/
http://www.intel.com/content/www/us/en/architecture-and-technology/64-
ia-32-architectures-optimization-manual.html
http://mechanical-sympathy.blogspot.co.uk
http://www.agner.org/optimize/microarchitecture.pdf
Mailing Lists:
https://groups.google.com/forum/#!forum/mechanical-sympathy
https://groups.google.com/a/jclarity.com/forum/#!forum/friends
http://gee.cs.oswego.edu/dl/concurrency-interest/
Performance and predictability
Q & A
@richardwarburto
insightfullogic.com
tinyurl.com/java8lambdas
Performance and predictability

More Related Content

PPTX
Basics of Distributed Systems - Distributed Storage
PDF
Introduction to Data streaming - 05/12/2014
PPT
IS 139 Lecture 7
PPT
IS 139 Lecture 6
PDF
SCALING THE HTM SPATIAL POOLER
PDF
xSDN - An Expressive Simulator for Dynamic Network Flows
PDF
Oversimplified CA
PPTX
Computer architecture
Basics of Distributed Systems - Distributed Storage
Introduction to Data streaming - 05/12/2014
IS 139 Lecture 7
IS 139 Lecture 6
SCALING THE HTM SPATIAL POOLER
xSDN - An Expressive Simulator for Dynamic Network Flows
Oversimplified CA
Computer architecture

Similar to Performance and predictability (20)

PDF
Performance and predictability
PDF
Performance and Predictability - Richard Warburton
PDF
Performance and predictability (1)
PPT
Memory Optimization
PPT
Memory Optimization
PPTX
Repository performance tuning
PPTX
Vault2016
PDF
Distributed Systems: scalability and high availability
PPTX
Cassandra in Operation
PPTX
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
PDF
High-Performance Physics Solver Design for Next Generation Consoles
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PPTX
Debunking the Myths of HDFS Erasure Coding Performance
ODP
Low level java programming
PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
PDF
Strata + Hadoop 2015 Slides
PPTX
CPU Memory Hierarchy and Caching Techniques
PDF
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
ODP
Experience In Building Scalable Web Sites Through Infrastructure's View
Performance and predictability
Performance and Predictability - Richard Warburton
Performance and predictability (1)
Memory Optimization
Memory Optimization
Repository performance tuning
Vault2016
Distributed Systems: scalability and high availability
Cassandra in Operation
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
High-Performance Physics Solver Design for Next Generation Consoles
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Debunking the Myths of HDFS Erasure Coding Performance
Low level java programming
Project Tungsten: Bringing Spark Closer to Bare Metal
Strata + Hadoop 2015 Slides
CPU Memory Hierarchy and Caching Techniques
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Experience In Building Scalable Web Sites Through Infrastructure's View
Ad

More from RichardWarburton (20)

PDF
Fantastic performance and where to find it
PDF
Production profiling what, why and how technical audience (3)
PDF
Production profiling: What, Why and How
PDF
Production profiling what, why and how (JBCN Edition)
PDF
Production Profiling: What, Why and How
PDF
Java collections the force awakens
PDF
Generics Past, Present and Future (Latest)
PDF
Collections forceawakens
PDF
Generics past, present and future
PDF
Jvm profiling under the hood
PDF
How to run a hackday
PDF
Generics Past, Present and Future
PDF
Pragmatic functional refactoring with java 8 (1)
PDF
Twins: Object Oriented Programming and Functional Programming
PDF
Pragmatic functional refactoring with java 8
PDF
Introduction to lambda behave
PDF
Introduction to lambda behave
PDF
Simplifying java with lambdas (short)
PDF
Twins: OOP and FP
PDF
Twins: OOP and FP
Fantastic performance and where to find it
Production profiling what, why and how technical audience (3)
Production profiling: What, Why and How
Production profiling what, why and how (JBCN Edition)
Production Profiling: What, Why and How
Java collections the force awakens
Generics Past, Present and Future (Latest)
Collections forceawakens
Generics past, present and future
Jvm profiling under the hood
How to run a hackday
Generics Past, Present and Future
Pragmatic functional refactoring with java 8 (1)
Twins: Object Oriented Programming and Functional Programming
Pragmatic functional refactoring with java 8
Introduction to lambda behave
Introduction to lambda behave
Simplifying java with lambdas (short)
Twins: OOP and FP
Twins: OOP and FP
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
Tartificialntelligence_presentation.pptx
PDF
August Patch Tuesday
PPTX
A Presentation on Artificial Intelligence
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
1. Introduction to Computer Programming.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Mushroom cultivation and it's methods.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Approach and Philosophy of On baking technology
Tartificialntelligence_presentation.pptx
August Patch Tuesday
A Presentation on Artificial Intelligence
Diabetes mellitus diagnosis method based random forest with bat algorithm
1. Introduction to Computer Programming.pptx
Empathic Computing: Creating Shared Understanding
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Mushroom cultivation and it's methods.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...
cloud_computing_Infrastucture_as_cloud_p
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Building Integrated photovoltaic BIPV_UPV.pdf
A comparative analysis of optical character recognition models for extracting...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...

Performance and predictability