Showing posts with label java flight recorder. Show all posts
Showing posts with label java flight recorder. Show all posts

Monday, 18 May 2015

Revisiting the First Rule of Performance Optimisation: Effects of Escape Analysis

In an earlier post I wrote about how to start optimising for performance. If you haven't read the post I would advise you do have a look but the long and short of it is that it's pretty much always worth reducing allocations as a first step to improving performance. But understanding exactly where your program actually allocates is not as easy as you might think...

Whilst reviewing an article on a similar topic it occurred to me that it would be interesting to try and measure exactly how much longer it takes to allocate as opposed to reuse an existing object. We know that allocation is really fast in Java (ignore the side effects of object allocation for now) so let's see how allocation compares to object re-use.

This is the first attempt at the code I wrote to try and measure that change:


The output form the program was as below:

No allocation took 15 with result 100000000
Allocations took 11 with result 100000000
No allocation took 12 with result 100000000
Allocations took 13 with result 100000000
No allocation took 0 with result 100000000
Allocations took 1 with result 100000000

Hardly anything in it - but then I got suspicious....

Was the testAllocations method actually allocating or was escape analysis being applied? Escape analysis allows the JIT to bypass the object allocation and use the stack instead of the heap. See here for more details.

Well one way to find out is to run the program with the command line setting -verbosegc and see if the gc was being invoked. As I suspected there was no gc, which, given the fact I had not set the -Xms and -Xmx flags should not have been possible.

So I wrote another little program forcing the testAllocate method to 'escape' and thereby allocate:


The output for this was very different, see below:

I've run with -verbose:gc  - I've picked out the System.out in red so that the times can be easily identified.

*** No allocation took 43 with result 100000000
[GC (Allocation Failure)  111405K->108693K(142336K), 0.1243020 secs]
[Full GC (Ergonomics)  108693K->69499K(190976K), 0.3276730 secs]
[GC (Allocation Failure)  102779K->102923K(204288K), 0.1275237 secs]
[GC (Allocation Failure)  149515K->149731K(224256K), 0.2172119 secs]
[Full GC (Ergonomics)  149731K->149326K(359936K), 1.1698787 secs]
[GC (Allocation Failure)  215886K->215502K(365056K), 0.2680804 secs]
 **** Allocations took 2321 with result 100000000
*** No allocation took 37 with result 100000000
[Full GC (Ergonomics)  365307K->52521K(314368K), 0.1279458 secs]
[GC (Allocation Failure)  124201K->124497K(314368K), 0.2493061 secs]
[GC (Allocation Failure)  196177K->196401K(419840K), 0.3004662 secs]
 **** Allocations took 733 with result 100000000
[GC (Allocation Failure)  276094K->274825K(429056K), 0.5655052 secs]
[Full GC (Ergonomics)  274825K->1024K(427520K), 0.0042224 secs]
*** No allocation took 583 with result 100000000
[GC (Allocation Failure)  98304K->59382K(527872K), 0.0727687 secs]
[GC (Allocation Failure)  224246K->224582K(534016K), 0.1465164 secs]
 **** Allocations took 263 with result 100000000
*** No allocation took 15 with result 100000000
[GC (Allocation Failure)  389446K->167046K(643584K), 0.0831585 secs]
 **** Allocations took 119 with result 100000000
[GC (Allocation Failure)  366101K->91504K(653312K), 0.0006495 secs]
*** No allocation took 15 with result 100000000
[GC (Allocation Failure)  329072K->290206K(681472K), 0.1650293 secs]
 **** Allocations took 202 with result 100000000

As you can see this time the program is definitely allocating and the time for the allocated test is considerably slower.

Let's see if we can eliminate the need for GC pauses by setting a high -Xms value.

This time the GC is almost eliminated and we get these results:

*** No allocation took 48 with result 100000000
 **** Allocations took 162 with result 100000000
*** No allocation took 44 with result 100000000
 **** Allocations took 172 with result 100000000
*** No allocation took 28 with result 100000000
 **** Allocations took 179 with result 100000000
*** No allocation took 40 with result 100000000
 **** Allocations took 365 with result 100000000
*** No allocation took 73 with result 100000000
[GC (Allocation Failure)  1572864K->188862K(6029312K), 0.2602120 secs]
 **** Allocations took 499 with result 100000000
*** No allocation took 43 with result 100000000
 **** Allocations took 62 with result 100000000
*** No allocation took 17 with result 100000000
 **** Allocations took 228 with result 100000000
*** No allocation took 95 with result 100000000
 **** Allocations took 364 with result 100000000
*** No allocation took 46 with result 100000000
 **** Allocations took 194 with result 100000000
*** No allocation took 14 with result 100000000
[GC (Allocation Failure)  1761726K->199238K(6029312K), 0.2232670 secs]

 **** Allocations took 282 with result 100000000

So even without GC, allocating is considerably slower.  

The next step was to examine what was happening by looking at my programs with a profiler.

I ran the first program, Allocation1, with my favourite profiler, YourKit and started seeing very different behaviour.  The program which we had previously shown not be allocating, was now allocating (I could see this my running with verbose GC below).

This is the output:

[YourKit Java Profiler 2014 build 14114] Log file: /Users/daniel/.yjp/log/Allocation1-11980.log
No allocation took 20 with result 100000000
[GC (Allocation Failure)  33280K->1320K(125952K), 0.0032859 secs]
[GC (Allocation Failure)  34600K->1232K(125952K), 0.0028476 secs]
[GC (Allocation Failure)  34512K->1136K(125952K), 0.0020955 secs]
[GC (Allocation Failure)  34416K->1168K(159232K), 0.0024346 secs]
[GC (Allocation Failure)  67728K->1240K(159232K), 0.0021334 secs]
Allocations took 109 with result 100000000
No allocation took 26 with result 100000000
[GC (Allocation Failure)  67800K->1376K(222208K), 0.0036706 secs]
[GC (Allocation Failure)  134496K->1508K(222208K), 0.0024567 secs]
Allocations took 98 with result 100000000
No allocation took 1 with result 100000000
[GC (Allocation Failure)  134628K->1516K(355840K), 0.0008581 secs]

Allocations took 56 with result 100000000

And in YourKit we see the creation of those objects:


Running with YourKit had interfered with the escape analysis and caused the objects to be allocated!!

The advice I gave in my previous post 'First Rule of Performance Optimisation' was to look to eliminate allocation.  This program (Allocation1) looks (from YourKit's view) like it is allocating and a developer, following my advice, might spend considerable effort reducing that allocation.  The developer would be surprised and disappointed when they found that on rerunning the program (without YourKit) that nothing at all had been achieved.  Of course now we know why that is, allocation were only a side-effect of running the profiler! Without the profiler the allocations were being 'escaped-anaysised' away.

So how are we to know whether allocations we see in the profiler are real or just caused by the profiler interfering with escape analysis.

The answer is to use Flight Recorder. See my previous post on the differences between Flight Recorder and other profilers.

This is the Flight Recording when we run Allocation1.



As you can see no allocations have been recorded.  Contrast this with the Flight Recording we get from running Allocation2.



Conclusions:

  • It is always better not to allocate objects
  • But escape analysis means that calling new does not always lead to an allocation! 
  • In general, profilers interfere with escape analysis so be careful of fixing an allocation that, without the profiler might be escaped away.
  • Using Flight Recorder should not interfere with escape analysis and give you a more accurate reflection as to your real allocation.


Further Reading on the subject: 

I subsequently came across this excellent (as usual) post from Nitsan Wakart
http://psy-lob-saw.blogspot.co.uk/2014/12/the-escape-of-arraylistiterator.html

Sunday, 10 May 2015

More Strange Behaviour in Tight Loops

In a previous article on the subject I pointed out the dangers of having code in tight loops especially their unintended affects on other threads running in the same JVM.

In this post I'll just point out a few interesting things I came across whilst testing the code in a tight loop.

Most Java developers, at some time, will have used the command line tools jps and jstack. jps is a command that let's you easily identify the Java processes running on your machine and their pids. When you have the pid of a Java process you can then use jstack to get a stack trace of that process. 

When I tried running running jstack on a process in a tight loop (see code here) I got this response:

Unable to open socket file: target process not responding or HotSpot VM not loaded The -F option can be used when the target process is not responding

I guess jstack needs the process to reach safepoint before being able to generate the stack trace.

Even more interestingly, when I tried opening the process in Flight Recorder (this is a really good profiler see here for a full write up) you get this dialog box:


Somewhat misleading - I have to say I spent quite a while puzzling over this before I realised that it was caused by trying to profile code in tight loop, and that all I had to do was add breathing space (see previous article) for Flight Recorder to start behaving again.

Summary 

I wonder whether having code in long running tight loops is actually a realistic scenario in which case my findings and issues will just be academic.  Or perhaps there might be scenarios, especially in failure cases, where this might actually be the case.  Certainly the behaviour of code in tight loops does shed some light on the way safepoints, GC, and monitoring tools interact with each other.


Wednesday, 11 March 2015

Java Flight Recorder since jdk 1.8.0_40 / Further comments on safe points

I refer you my first blog post on this subject where I went through the fundamentals of Java Flight Recorder.

The really annoying thing about the tool is that in order to be able to run JFR against your program, the program had to be started with the system properties:

-XX:+UnlockCommercialFeatures -XX:+FlightRecorder. 
This is not always convenient or even possible in certain runtime environments or where the process has been running a long time and is in a particular state that you would like to profile rather than restart.

Enter the latest jdk 1.8.0_40. With this build of the jdk you no longer need to have those options enabled when you started the java program. Just launch jmc and select your process as per normal and you will be presented with this dialog box.


You will be able to enable those command line options dynamically.  An excellent improvement to this excellent product!

Whilst on the topic of command line options for JFR I've started using this system property when running my programs. -XX:+DebugNonSafepoints 

According to the documentation:
One nice property of the JFR method profiler is that it does not require threads to be at safe points in order for stacks to be sampled. However, since the common case is that stacks will only be walked at safe points, HotSpot normally does not provide metadata for non-safe point parts of the code, which means that such samples will not be properly resolved to the correct line number and BCI. That is, unless you specify:
-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints 
With DebugNonSafepoints, the compiler will generate the necessary metadata for the parts of the code not at safe points as well. 

To my mind the whole point of using JFR as opposed to YourKit etc is that it doesn't have to respect safe points and you can get more accurate stats about where time is actually being spent. I have seen evidence of this even without using this system property. But with this option enabled I expect the debug stats are even more accurate. I had a situation where I was looking for hotspot and JFR pointed me to a high level method which didn't help me at all. I then enabled the DebugNonSafepoints JFR pointed me to the exact point in the code where hotspot actually lay - fantastic!

Friday, 16 January 2015

Java Flight Recorder (JFR)

JFR is a Java profiler which will allow you to investigate the runtime characteristics of your code. Typically you will use a profiler to determine which parts of your code are causing  large amounts of memory allocation or causing excess CPU to be consumed.

There are plenty of products out there.  In the past I've used YourKit, OptimizeIt, JProfiler, NetBeans and others. Each has its benefits and it is largely a matter of personal preference as to which you choose. My current personal favourite is YourKit. It integrates nicely into IntelliJ has a relatively low overhead and presents its reports well.

The truth is that profiling is a very inexact science and it is often worth looking at more than one profiler to build up a clearer picture of what exactly is going on in your program. To my knowledge most of the profilers rely on the JVMP/JVMTI agents to probe the Java program. A major problem with this is safe points. This means your Java program can only be probed when it is at a safe point. This means that you will get a false picture of what is really going on in your program especially if much of the activity is between safe points. Also all profilers, to a varying degree add overhead.  Profiler overhead will change the characteristics of your program and may cause misleading results from your analysis. Much more information here.

Enter JFR.  JRF has been bundled with the JDK since release 7u40. JFR is built with direct access to the JVM. This not only means that there is a very low overhead (claimed to be less than 1% in nearly all cases) but also does not rely on safe points.  Have a look here at an example of how radically different an analysis from YourKit and JFR can look.

To run JFR you need to add these switches to your Java command line:

-XX:+UnlockCommercialFeatures -XX:+FlightRecorder

JFR is located in Java Mission Control (JMC).  To launch JMC just type jmc in your command line and if you have the JDK in your path the JMC console will launch.  You should see your Java program in the left hand pane.  Right click on your program and then start flight recording.


You will be presented with a dialog box where you can just accept the defaults (sample for a minute) and then your results will be displayed.  It's worth paying around with the options to find how this will work best for you.  As with all good products this GUI is fairly intuitive.

As you can tell from the command line switches it is commercial feature.  I'm not exactly sure what that means but you can read more about that in the documentation here. Also you can run this from the command line, it's all in the documentation.

One problem I did find was when I downloaded the latest Java8 snapshot (at this time 1.8.0_40-ea) I was unable to launch my program and got the following message:

/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/
Error: Trying to use 'UnlockCommercialFeatures', but commercial features are not available in this VM.
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

In summary, JFR is a great addition to any developers toolkit and as long as you are using JDK release 7u40 or above it's certainly worth trying it out on your code.

(I encourage you to have a look at a previous post First rule of performance optimisation in conjunction with JFR)