Skip to content

[WIP] Add JDK Flight Recorder support for Java native images #3070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

jiekang
Copy link
Collaborator

@jiekang jiekang commented Dec 15, 2020

This PR adds infrastructure to SubstrateVM supporting JDK Flight Recorder for Java native images (#3018). This is a work in progress and the intention is to welcome feedback from Oracle’s GraalVM team and the GraalVM community during the development process leading up to initial integration.

Completed Features

  • Provide compile-time configuration to include or exclude the JFR infrastructure (-H:+FlightRecorder)
  • Support existing OpenJDK API to start, stop and dump JFR v2.1 compliant recording files (.jfr) during native image run (jdk.jfr.Recording and related classes)
  • Support build-time inclusion of application-defined events derived from jdk.jfr.Event and related classes

In Progress Features

  • Emit OpenJDK Java specific events (Socket/File read/write, ExceptionThrown, etc.)
  • Allow command-line options to configure recordings on start-up, similar to the options in OpenJDK

Other Issues in progress

Along with the major features above, many minor issues that will be addressed are tracked here: https://github.com/rh-jmc-team/graal/issues Issues labelled “jfr-initial-release” are targeted for eventual completion in this PR. This issue tracker is temporary and used to cleanly separate work items for this integration in a publicly accessible location. Note, as this is WIP, new issues will surely appear and be tracked there as well. Once this PR is accepted, we will follow upstream practices for all future work. As well, more long-term additions are described in the original issue: #3018

Testing

This implementation has been tested against native images produced from sources in this repository: https://github.com/rh-jmc-team/jfr-tests At this time, tests are manually run. We intend to add automated test infrastructure to compile test sources into native images, run them, and verify the resulting JFR file. Tests against other applications have been and will continue to be run as well.

As an example from the jfr-tests repository, TestConcurrentEvents starts a recording to disk, creates 8 threads that emit 1024*1024 events each, stops the recording and prints the location of the produced .jfr file. See substratevm/JFR.md in this branch for instructions on building with FlightRecorder enabled.

Contributors:

Andrew Dinn [email protected]
Jie Kang [email protected]
Roman Kennke [email protected]
Josh Matsuoka [email protected]
Tako Schotanus [email protected]
Simon Tooke [email protected]

jiekang and others added 2 commits December 15, 2020 13:36
Co-authored-by: Andrew Dinn <[email protected]>
Co-authored-by: Roman Kennke <[email protected]>
Co-authored-by: Josh Matsuoka <[email protected]>
Co-authored-by: Tako Schotanus <[email protected]>
Co-authored-by: Simon Tooke <[email protected]>
@graalvmbot
Copy link
Collaborator

Hello jkang, thanks for contributing a PR to our project!

We use the Oracle Contributor Agreement to make the copyright of contributions clear. We don't have a record of you having signed this yet, based on your email address jkang -(at)- redhat -(dot)- com. You can sign it at that link.

If you think you've already signed it, please comment below and we'll check.

@jerboaa
Copy link
Collaborator

jerboaa commented Dec 16, 2020

Note that @jiekang is covered by the Red Hat OCA. He works for Red Hat (as do other contributors).

@thegreystone
Copy link

Nice one guys! :)

@christianwimmer
Copy link

Thanks for the PR, we will have a look at it. Due to the imminent winter break, holidays, vacation, ... a realistic ETA for the first round of comments on our side is middle of January.

I'm also glad that you explicitly mark the PR as "WIP". Still, it is a lot of work that you already did on it (about 9000 lines of code added). I do not think it is a sustainable approach to make such large contributions without discussing the overall architecture beforehand. Ideally, the architecture discussions should start before writing the first line of code. Slack and Zoom discussions upfront are far more time effective than after-the-fact reviews.

@adinn
Copy link
Collaborator

adinn commented Dec 17, 2020

Hi Christian,

Yes, this is a work in progress -- although I do feel we have made a lot of progress to arrive where we are now. I understand why you might have preferred us to discuss this PR in advance. However, we had two problems doing that before now. First, we felt we had to have something tangible to discuss. Second, we wanted to be able to enter any discussion from a position where we understood enough of the complexities of how to implement a solution to make that discussion realistic and not just end up offloading onto you work that we had chosen to undertake.

We have released this PR with a basic set of useful, fully working features, extra features being left for further work. So, this really constitutes a 'pilot' implementation i.e. one that tests the feasibility of a partial solution right through to completion and at the same time tries to establish an acceptable path forward to a full solution. We have no belief that this implementation is the only way or the right way to do this and certainly have no expectation that you will simply rubber-stamp our work.

That explains why we wanted to be sure our approach really did work for those basic cases before asking you to look at it and help us determine if there is a better approach. In getting to this point we have uncovered and resolved a lot of problems shoe-horning JFR into Graal native, many to do with the change from dynamic to static (build/runtime) operation, just as many to do with the need to uncouple the close linkage between JFR and Hotspot and identify/recreate an equivalent in GraalVM. We have also understood a lot about the problems we have not yet addressed. I really don't believe we could have had a useful discussion with you without working this solution through as far as we have.

So, please don't feel we are trying to push this through a hasty review with no desire to listen to your opinion or follow your advice. Of course, we want to get a working version released as soon as possible in order for our users to be able to benefit from having a monitoring solution. However, we also want to get it right, so that we can maintain the functionality we have implemented so far and add the missing capabilities that we also need. We very much welcome your feedback and hope you can help us steer this towards a better design and better implementation.

@christianhaeubl
Copy link
Member

Thanks for the PR. So far, I only had a very brief look at the code but here a couple high-level comments that will probably have a large impact on your work in progress:

  • The current implementation assumes that it will only be used in Java code that can allocate Java objects. However, JFR events (e.g., allocation, safepoint, and GC related events) may be triggered in places where Native Image must not allocate Java objects and can only call uninterruptible code. So, a lot of the core JFR infrastructure will have to be uninterruptible.
  • Similar to HotSpot, every thread will need a Java and a native buffer. Otherwise, events such as safepoints or GCs would destroy the Java-level buffer.
  • Please note that VM operations on Native Image can allocate and cause a GC. This is a major difference to HotSpot and may cause issues as a GC may trigger JFR events. Similar to that, I saw that a method that allocates a Java object but that has the following comment: "Do not safepoint here".
  • A few things are too close to the HotSpot sources where this was ported from, e.g., most parts related to class (un)loading won't be needed for Native Image.

As @christianwimmer already mentioned, I will have a closer look at the PR in January.

@christianwimmer
Copy link

Hi all, now that all the holidays are over we have some discussions and decisions to make.

We have also worked on some initial JFR support in the last months, and we have the approvals to make it public. @christianhaeubl is working on that, and will have that ready soon. Our work is also heavy work-in-progress. Some parts are more complete than this PR (like avoiding problems with GC and safepoints), some parts are less complete.

Our hope would be to combine both implementations and work together on a combined implementation. As as starting point, I suggest that we schedule a Zoom with all contributors. To find a time for that, and also to have more interactive control, we created a dedicated Slack channel in the GraalVM slack workspace: #native-image-jfr. This link should get you directly to it: https://graalvm.slack.com/archives/C01K81PSVLM

@adinn
Copy link
Collaborator

adinn commented Jan 21, 2021

@christianwimmer Hi Christian. Thanks for the update and enormous thanks to the Oracle team for deciding to open up your work and combine forces with the rest of the community to ensure JFR works in CE releases. Red Hat very much appreciates your offer of help and welcomes the benefit of your expertise in areas like GC and safepoints where our experience is still weak. Looking forward to the Zoom call and following up on this effort on Slack.

Copy link
Member

@christianhaeubl christianhaeubl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a partial and not very detailed review. I skipped many classes where I think that large changes will be necessary anyways due to the differences between the HotSpot and the Native Image safepoint and VM thread handling.

Major issues:

  • With the current implementation, it won't be possible to support many of the native events (i.e., the ones that are defined in the metadata.xml file).
  • In Native Image, every thread can trigger JFR events. So, even if a thread is at a safepoint and executes a VM operation, it can still trigger JFR events (e.g., because it allocates a Java object). This is a major difference to HotSpot and breaks some of the assumptions that the HotSpot code has.

package com.oracle.svm.core.jdk.jfr;

public class JavaJfrUtils {
public static boolean isJFRClass(Class<?> clazz) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clearly separate between code that can be executed at runtime and code that is only executed at image build time, see @Platforms(Platform.HOSTED_ONLY.class) (affects the whole PR).

import org.graalvm.compiler.serviceprovider.JavaVersionUtil;

public class JfrAvailability {
public static boolean withJfr = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an existing (AllowVMInspection) or a new HostedOption instead.

import jdk.internal.event.Event;

@AutomaticFeature
public class JfrFeature implements Feature {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, JfrRuntimeAccessImpl is always getting registered. Is there a reason for that? I think this whole feature should only be enabled conditionally:

    @Override
    public boolean isInConfiguration(IsInConfigurationAccess access) {
        return JfrAvailability.withJfr;
    }

If the whole feature is enabled conditionally, the invidual checks for JfrAvailability.withJfr (at the begin of each method) can be removed.

Set<Module> modules = new HashSet<>();
for (Class<?> clazz : reachableClasses) {
if (JfrTraceId.getTraceId(clazz) == -1) {
JfrTraceId.assign(clazz);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary or can DynamicHub.getTypeID() be used as the trace id?

@@ -581,6 +581,24 @@
"spotbugs": "false",
},

"com.oracle.svm.core.jdk.jfr.test": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JFR implementation should be in a separate project as well. Then it is easier to exclude the files from compilation when a JDK < 11 is used.

public final int bit;

JfrMsg(int msg) {
this.msg = msg;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg doesn't seem to be used. I think it would be more readable if the shifted bit was passed into the constructor, either in hex or binary representation (e.g., 0x80 or 0b10000000) as this is a well-known pattern for bit-based enums.

// is complete as it should shutdown cleanly as well
t.setDaemon(true);
t.start();
Jfr.excludeThread(t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluding a thread is something that will cause issues on Native Image (i.e., JFR events will be missing). In Native Image, any thread can trigger JFR events. This even includes the VM thread when it is at a safepoint. This is a key difference to HotSpot and therefore breaks a lot of the assumptions that are in the HotSpot code.

A few examples that can happen when the JfrRecorderThread is excluded:

  • The JfrRecorderThread may allocate new Java objects. The JFR events for those object allocations would be suppressed.
  • The JfrRecorderThread allocates a new Java object. This triggers a GC, so the JfrRecorderThread is temporarily promoted to the VM thread. So, all GC-related events would be suppressed.

@@ -59,6 +60,9 @@
@Inject @RecomputeFieldValue(kind = RecomputeFieldValue.Kind.Reset)//
boolean wasStartedByCurrentIsolate;

@Inject @RecomputeFieldValue(kind = RecomputeFieldValue.Kind.NewInstance, declClass = JfrThreadLocal.class)//
JfrThreadLocal jfrThreadLocal;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to store thread-local values directly on the Java Thread object. We use a different concept for thread-locals, see FastThreadLocal.

*
*/

public class JfrBuffer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already mentioned, everything related to JFR storage infrastructure must be uninterruptible. So, no Java memory may be allocated and if Java objects are used, they must live in the image heap. Otherwise, it will be impossible to support many of the native events, e.g., allocation and GC-related events.

import com.oracle.svm.core.jdk.jfr.recorder.checkpoint.types.traceid.JfrTraceIdEpoch;
import com.oracle.svm.core.jdk.jfr.recorder.storage.operations.JfrBufferOperations.BufferOperation;

public class JfrMemorySpace<U extends JfrMemorySpaceRetrieval, V extends JfrMemorySpace.Client> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it makes sense to port the HotSpot C++ templates to Java generics. In HotSpot, the templates kill readability but at least provide good performance. In Native Image, the Java generics may reduce both the readability and the performance as they introduce virtual calls.

@graalvmbot
Copy link
Collaborator

jkang has signed the Oracle Contributor Agreement (based on email address jkang -(at)- redhat -(dot)- com) so can contribute to this repository.

@jiekang
Copy link
Collaborator Author

jiekang commented Jun 28, 2021

Closing as the work has been done in #3444

@jiekang jiekang closed this Jun 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants