Millions quotes per second in pure java

Millions Quotes Per Second.
A story of pure Java
market data vendor

© 2013, Roman Elizarov, Devexperts

Market Data Rates
10000 000

9000 000

8000 000

7000 000
messages per second

6000 000

5000 000

4000 000

3000 000

2000 000

1000 000

0
Основной Основной Основной Основной Основной Основной Основной Основной

US Equities, Indexes and Futures OPRA

Market Data Vendor

• Process data coming from exchange data feeds
- Parse
- Normalize
• Distribute data to customers
- Gather into a single feed
- Store and retrieve (for onDemand historical requests)
- Serialize and transfer
- Scatter to multiple consumers based on actual subscription

dxFeed High Level Picture

CME, CBOT, NYMEX, COMEX,
ICE Futures U.S., CBOE, TSX, TSXV,
MX

Chicago ticker plant

10Gbit
resilient redundant connectivity
infrastructure
NYSE, AMEX,
NASDAQ,
ISE, OPRA,
FINRA, PinkSheets

New York ticker plant

Direct cross-connect
Customer connection point
SFTI
TNS
SAVVIS
BT Radianz
Internet

A Bit of History

• Devexperts was founded in 2002
- as an Upscale Financial IT company
• QDS project was born in 2003
- to address market data distribution problem
- in a high performance-way (initial design goal was 1M mps)
• dxFeed service was launched in 2008
- to provide our customers with live market data directly from
exchanges, using QDS for distribution
• dxFeed API was created on top of QDS in 2009
- to provide an easier customer-facing API and enable 3rd party
developers to integrate their code with dxFeed

Threads Portability
Community Developers
Garbage Collection

Libraries and frameworks
Backwards-compatibility

Refactoring Type Safety

Open source
Memory model
Reflection
Productivity Tools
Readability
HotSpot JIT

Byte-code manipulation
Simplicity The most popular language

Java object layout
String[] • String[] that is filled with
some strings in Java
header

size String
[0]
header
[1] char[]
[2] value
header
[3] hash

... String size

„T‟
header
„E‟

value „S‟

hash „T‟

... ...

Millions quotes per second in pure java

Memory layout solution

• Prefer array-based data-structures to linked ones
- Most Java programs get immediate performance boost by replacing all
mentions of LinkedList by ArrayList
• Use Java arrays or ByteBuffer classes where it matters
- They are guaranteed to be contiguous in memory
- Layout your data into array manually
• That‟s how QDS core is designed
- All it critical data structures are rolled onto int[] and Object[]

byte[] vs ByteBuffer

• byte[] is always heap-based
- Faster for byte-oriented access
• ByteBuffer can be both “heap” and “direct”
- Be especially careful with direct ByteBuffers
- If you don‟t Pool them, you may run out of native memory before Java
GC has a chance to run
- Can be faster for short-, int- or long- oriented access via get/putXXX
methods
• But make sure you use native byte order (BIG_ENDIAN is default)
- Direct ByteBuffers don‟t need an extra buffer copy when doing
input/output with NIO

The cost of later change is too high

Garbage collection

• Makes your code much easier
- to design
- to debug
- to maintain
• GC performs really well when
- Objects are very short-lived
• They are not promoted to old gen
• They are reclaimed by high-throughput scavenge GC
- Object are very long-lived and are not modified or contain primitives
• Scavenge GC does not waste time scanning them

Object allocation

• Allocation of small objects is fast
- new String() is ~20 bytes on 64bit VM with compressed oops
• not counting char[] object inside of it
- ~4.5ns per allocation (on 2.6GHz i5)
• But becomes slower when you include amortized GC cost
• And can become much slower if you
- have big static memory footprint
- have “medium-lived” objects
- have lots of threads (and thus a lot of GC roots and coordination)
- use references (java.lang.ref) a lot
- mutate your memory a lot, especially references (GC card marking)

Manual memory management

• When you would consider manual memory management in native
code (custom object pools), consider doing the same in Java
• General advise
- Pool large objects
• They are expensive to be allocated and to be collected by GC
- Avoid small objects
• Especially “medium-lived” ones
• Layout them into arrays if you need store them

Object allocation action plan (1)

• Watch the percentage of time your system spends doing GC
- -verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
- “jconsole” and “jvisualvm” tools show this information
- It is available programmatically via GarbageCollectorMXBean
• At Devexperts we collect it and report (push) in real-time via
MARS (Monitoring and Reporting System) using a dedicated
JVMSelfMonitoring plugin
• Our support team have alerts configured on high GC % in our
systems
• Act when it becomes too big

Object allocation action plan (2)

• Tune GC to reduce overhead without code changes
• Identify places when most of allocations take places and optimize
them
- Use off-the-shelf Java profilers
- Use Devexperts aprof for a full allocation picture at production speed
http://code.devexperts.com/display/AProf/

Object reuse and sharing

• Pooling small objects in often a bad idea
- Unless you are trying to quickly speed up code that heavily relies on
lots of small objects
- It‟s better to get rid of small objects altogether
• See boxing in performance critical code  get rid of it
• But reusing / sharing small objects is great
- Strings are typical candidate for data-processing code
• Common pitfalls (don‟t do it, unless you fully understand it)
- String.intern
- WeakReference

Actually, by their char arrays

String I/O

• String are often duplicated in memory
• Reading any string-denoted data from database, from file, from
network – all produces new strings
• Where performance matters, reuse strings
- For example see StringCache class from
http://docs.dxfeed.com/dxlib/api/com/devexperts/util/StringCache.html
- The key method is get(char[])
• You can reuse char[] where data is read
• And get an instance of String from cache if it is there

Radical object / reference elimination

• Unroll complex objects into arrays
- For example, a collection of strings can be represented in a single
byte[]
• Renumber shared object instances
- Represent string reference as int
- That‟s what QDS core does for efficient String manipulation
• Faster to compare
• Faster to hash
• Avoids slower “modify reference” operations (marks GC cards)
- But requires hand-crafted memory management
• QDS does reference counting, but custom GC is also feasible

Hardcore optimization

• Use sun.misc.Unsafe when everything else fails
- It gives you full native speed
- But no range checks nor type-safety
• You are on your own!
- Good fit for integration with native data structures when needed
• QDS core uses it in few places
- Mainly to provide wait-free execution guarantees with an appropriate
synchronization for array-based data structures
- But there is a fallback code for cases when sun.misc.Unsafe is not
available

Even more hardcore – hand-written SMT

• If you have to use linked data structures
- Consider traversing multiple linked lists simultaneously in the same
thread
- Akin to hardware SMT, but in software
- The code becomes much more complicated
- But the performance can considerably increase

* Not a Java-specific optimization, but fun to mention here

Threads and scalability

• Share data across the threads to further reduce memory footprint
- But carefully design and implement this sharing
• Learn and love Java Memory Model
- It makes your correctly-synchronized multi-threaded code fully
portable across CPU architectures
• QDS core is a thread-safe data structure with a mix of lock-
free, fine-grained and coarse-grained locking approaches which
makes it vertically scalable

Be careful with threads and locks

• Thread switches introduce a
considerable latency (~20us) 1. Enter Lock

• Lock contention forces even 2. Context Switch

more thread switches 3. Try to lock

• It is not a Java-specific 4. Context Switch

5. Exit Lock
concern, but a common Java- 6. Context switch

specific problem, since Java and enter lock

makes threads easier for
programmers to use (and many
do use them)

Data flow for horizontal scalability

Subscribes:
IBM, GE. QQQQ, MSFT,
INTC, SPX

IBM, GE ticks

Multiplexor

QDTicker

GE ticks
IBM, GE ticks

Subscibes: Subscibes:
IBM, GE, QQQQ, MSFT GE, INTC, SPX

QDTicker QDTicker

IBM
GE GE SPX
MSFT
IBM INTC INTC
QQQQ

HotSpot Server VM

• Run “java -server” (it is a default on server-class machines)
• Does
- Very deep code inlining
- Loop unrolling
- Optimize virtual and interface calls based on collected profile
- Escape analysis for synchronization and allocation elimination
• Embrace it!
- Don‟t fear writing your code in a nice object-oriented way
• In most of cases, that is
• Do still avoid too much “object orientation” in the most
performance-sensitive places

HotSpot challenges

• It is harder to profile, stress-test, and tune code
- You need to “warm up” the code to get meaningful result
- Small changes in code can lead to big differences that are hard to
explain
- Compilation of less busy code can trigger at any time and cause
unexpected latency spikes
• Don‟t do micro-tests
- Test the whole system together instead
• Do micro-tests
- To learn which code patters are better across the board
- Small savings add up

Looking at generated assembly code

• -XX:+UnlockDiagnosticVMOptions
-XX:CompileCommand=print,*<class-name>.<method-name>
-XX:PrintAssemblyOptions=intel
• You will need “hsdis” library added to your JRE/JDK with the actual
disassembler code
- But you have to build it yourself:
http://hg.openjdk.java.net/jdk7/hotspot/hotspot/file/tip/src/share/tools/hsdis/README

Use native profilers

• Java profiles are great tools, but they don‟t use processor
performance counters and lack the ability to recognize such
problems like memory pressure
- And they don‟t always produce a clear picture
- All “cpu time” is reported at the nearest “safe point”, not at the actual
code line that consumed CPU
• Use native profilers to figure it out
- Sun Studio Performance Analyzer
- Intel VTune Amplifier
- AMD CodeAnalyst

General (1)

• Classic data structures and algorithms
- Use CPU and memory efficient data structures and algorithms
- Know and love hash tables
• They are the most useful data structure in a typical business
application
• Lock-free data structures will help you to scale vertically
• Every byte counts. Remember about bytes.
- QDS core compactly represents data as 4-byte integers while working
with them in memory
- QDS uses compact byte-level compression on the wire
- Even more compact bit-level compression is used in long-term store

General (2)

• Burst handling
- Process data in batches to amortize batch overhead across messages
- QDS increases batch size under load to decrease overhead
• Architecture
- Use layers
- Lower layers of architectures should generally be used in more places
and be more optimized
- The outer layer, dxFeed API, is the easies one to use and understand
and most object-oriented, but less optimized

Architecture layers

JS API

dxFeed API Tools Gateways

QDS Core

Transport Protocol

ZLIB SSL

Sockets NIO Files, etc

QDS API (1)
print quote bid/ask on the screen

QDS API Summary

• Pros
- High-performance design
- Flexible (can be used in various ways)
• QDS Multiplexor is an application on top of QDS API
• As well as all other command-line QDS tools
- Extensible with clear separation of interfaces and implementation
• Cons
- Verbose, lots of code to do simple things
- Error-prone (easy to get wrong and to introduce subtle bugs)
• Everybody needs Quote, Trade, etc with easy-to-use API
- Hence, dxFeed API was born

dxFeed API
print quote bid/ask on the screen

Contact me by email: elizarov at devexperts.com

Millions quotes per second in pure java

More Related Content

What's hot (20)

Similar to Millions quotes per second in pure java (20)

More from Roman Elizarov (20)

Millions quotes per second in pure java