[RFC]: Improving FPMR Handling for FP8 Intrinsics in LLVM

CarolineConcatto · June 13, 2025, 2:48pm

Summary
We are proposing improvements to the LLVM IR representation and optimization of the AArch64 FPMR state register used by FP8 intrinsics. The current approach relies on IntrInaccessibleMemOnly to preserve the ordering between FPMR setup and FP8 instructions, but this inhibits many standard compiler optimizations. This RFC outlines multiple options to improve performance by enabling optimizations, while still preserving correctness. We would like to know if there is a preferred or non-preferred way of doing this. We have thought of some solutions and would like to have some feedback.

Background and Motivation

Arm’s FP8 intrinsics require setting the FPMR register, which encodes the format, scale, and overflow mode for 8-bit floating-point operations. Since C lacks a first-class notion of FP8 types with varying representations, each FP8 intrinsic in the Arm C Language Extensions (ACLE) maps to two LLVM IR intrinsics:

A call to set FPMR (e.g., llvm.aarch64.set.fpmr)
The actual FP8 intrinsic (e.g., llvm.aarch64.neon.fp8.cvt)

To ensure the FPMR setting is preserved correctly, we currently mark the fp8 intrinsic as IntrInaccessibleMemOnly. This approach guarantees execution order but blocks common transformations such as redundant write of FPRM by elimination and code motion.
For example, given the C code:

float16x8_t v1 = vcvt2_f16_mf8_fpm(op1, fpm); 
float16x8_t v2 = vcvt2_f16_mf8_fpm(op2, fpm);

The current LLVM IR emitted is:

call void @llvm.aarch64.set.fpmr(i64 %fpm) 
%v1 = call <8 x half> @llvm.aarch64.neon.fp8.cvtl2.v8f16.v8i8(<8 x i8> %op1) 
 
call void @llvm.aarch64.set.fpmr(i64 %fpm) 
%v2 = call <8 x half> @llvm.aarch64.neon.fp8.cvtl2.v8f16.v8i8(<8 x i8> %op2)

Despite identical FPMR values, the compiler cannot remove or hoist these calls due to the strict memory side effects.

Goals

• Ideally an IR based representation rather than pushing everything to the code generator.

• Enable standard optimizations (e.g., hoisting, CSE) for set.fpmr when safe.

• Maintain correct ordering and dependency between FPMR writes and FP8 intrinsics.

• Avoid over-restricting transformations with IntrInaccessibleMemOnly.

• Provide a scalable and reusable solution for future instruction encodings with similar properties (e.g., SME ZA and ZT0 registers). We have a similar investigation ongoing to track the liveness of ZA state at the IR level where we’re hoping to build on top of the FPMR design.
Proposed Approaches

Option 1: Model FPMR write via Address Space and Memory Access

1.1. Use Store to a Global in a custom Address Space and create new metadata to represent dependencies between the store and the fp8 intrinsic

In this option we replace IntrInaccessibleMemOnly by IntrReadMem. We cannot use IntrNoMem because it prevents any further alias analysis.

Replace @llvm.aarch64.set.fpmr with a store to a global variable in a non-default address space (e.g., addrspace(1)):

store i8 fpm_value, ptr addrspace(1) @globalFPMRAddr

FP8 intrinsics are then modeled as reads from this address with the use of a new metadata “onlyreads.addrspace”:

%v = call <4 x half> @llvm.aarch64.neon.fp8.fdot2(...), !onlyreads.addrspace !0 
!0 = !{!” globalFPMRAddr”, i32 1}

This allows alias analysis to determine that repeated stores with the same value can be optimized.

Pros:

• Leverages existing memory modeling and analysis infrastructure.

• Makes dependencies explicit via memory SSA and aliasing.

• No changes required to existing FP8 intrinsic function definitions.

Cons:

• Requires representing a system register as a memory object.

• Need to add a backend hook to assume that different address spaces cannot alias.

Open Questions:

Is it sensible to use an address space to represent a single register? Or are we changing the purpose of address space?

Possible Extension:

• !onlywrites.addrspace

• !readwrite.addrspace

This approach preserves optimization opportunities when aliasing is known to be disjoint.

1.2. Use Store to a Global in a custom Address Space and encode relation via pointers to global address space in the fp8 intrinsic

Replace @llvm.aarch64.set.fpmr with a store to a global variable in a non-default address space (e.g., addrspace(1)) and change the fp8 intrinsics to have an extra parameter to point to the FPMR global pointer.

store i8 23, ptr addrspace(1) @globalFPMRAddr

FP8 intrinsics are then modeled as reads from this address with the use of an extra function parameter and IntrArgMemOnly:

%v = call <4 x half> @llvm.aarch64.neon.fp8.fdot2( ..., ptr addrspace(1) @globalFPMRAddr )

Pros:

• Leverages existing memory modeling and analysis infrastructure.

• Makes dependencies explicit via memory SSA and aliasing.

Cons:

• All intrinsics affected by FPMR will need to be changed to add the extra parameter and further changes in codegen to ignore the extra parameter.

• It might not scale nicely if we want to add the same solution for SME ZA and ZT registers

Option 2: Extend IntrInaccessibleMemOnly into an Enum

Redefine the IntrInaccessibleMemOnly bit as an enumeration to distinguish among different kinds of inaccessible memory effects, such as:

• InaccessibleMemory (legacy behavior)

• FPMRRegister

• OtherRegisterX

This can allow passes to recognize cases where transformations are still valid (e.g., when setting the same register value repeatedly).

Pros:

• Avoids using address-spaces and memory modeling.

• No changes are needed in the implementation of FP8.

Cons:

• Requires updating all passes that interpret IntrInaccessibleMemOnly to ensure they distinguish between the different enum values.

• May be less general than Option 1.

Option 3: Backend-Only Handling

Defer all FPMR modeling to the backend and don’t represent the dependency in IR.

%v = call <4 x half> @llvm.aarch64.neon.fp8.fdot2( ..., fpm)

Pros:

• Simplifies IR semantics.

Cons:

• Prevents mid-end optimizations.

• Makes IR opaque to analysis.

• We were hoping to avoid this path.

• Requires changing all affected intrinsics to add an extra parameter like option 1.2 albeit this time with the parameter being the FPMR value rather than a pointer to the value.

Recommendation

We believe Option 1, especially 1.1 using address space metadata, provides the best balance between performance, correctness, and extensibility.

We are looking for feedback on:

• The suitability of address space and metadata modeling.

• Feasibility of extending IntrInaccessibleMemOnly semantics.

• Any alternative idioms used for similar problems (e.g., SME ZA and ZT)

nikic · June 14, 2025, 8:36am

If I understand correctly, the llvm.aarch64.set.fpmr should not only be InaccessibleMemOnly, it should also be WriteOnly, i.e. the full effects are memory(inaccessiblemem: write).

For writeonly calls with identical arguments, we should be able to CSE the calls. I’m kind of surprised that we don’t already do this: Compiler Explorer I could have sworn we have handling in EarlyCSE for this. Apparently I had this patch in mind that never landed: ⚙ D116609 [EarlyCSE] Allow elimination of redundant writeonly calls It should be easy to complete it if you’re interested. There’s also this pending patch to handle writeonly calls in LICM: [LICM] Hoisting writeonly calls by WanderingAura · Pull Request #143799 · llvm/llvm-project · GitHub

Would that largely solve your problem?

To comment on some of the rest:

I think you should definitely not use a magic global for this purpose. This is not normal memory, and pretending that it is can only lead to pain down the line. Like, what if we sink two i8 stores of the same value and your magic global address is now hidden behind a phi node and you can no longer lower it?
We can describe “inaccessiblemem” more precisely by adding additional locations in llvm-project/llvm/include/llvm/Support/ModRef.h at f46c44dbc0d225277178cf5b6646a96f591fdeaa · llvm/llvm-project · GitHub. If distinguishing different kinds of inaccessible memory is critical in your case, this is an option. (From your description, it sounds like interference between different inaccessiblemem is not really your main problem right now.)

Topic		Replies	Views
Add LLVM type support for fp8 data types (F8E4M3 and F8E5M2) LLVM Project llvm	21	3374	March 19, 2024
RFC: Add APFloat and MLIR type support for fp8 (e5m2) LLVM Project	21	3317	March 28, 2025
[RFC] Improving IR fast-math semantics IR & Optimizations core , rfc , llvm , llvm-ir	28	1111	March 26, 2025
Representing -ffast-math at the IR level LLVM Dev List Archives	4	100	April 17, 2012
clarification needed for the constrained fp implementation. LLVM Dev List Archives	0	133	November 27, 2017

[RFC]: Improving FPMR Handling for FP8 Intrinsics in LLVM

Related topics