Summary
We are proposing improvements to the LLVM IR representation and optimization of the AArch64 FPMR state register used by FP8 intrinsics. The current approach relies on IntrInaccessibleMemOnly to preserve the ordering between FPMR setup and FP8 instructions, but this inhibits many standard compiler optimizations. This RFC outlines multiple options to improve performance by enabling optimizations, while still preserving correctness. We would like to know if there is a preferred or non-preferred way of doing this. We have thought of some solutions and would like to have some feedback.
Background and Motivation
Arm’s FP8 intrinsics require setting the FPMR register, which encodes the format, scale, and overflow mode for 8-bit floating-point operations. Since C lacks a first-class notion of FP8 types with varying representations, each FP8 intrinsic in the Arm C Language Extensions (ACLE) maps to two LLVM IR intrinsics:
-
A call to set FPMR (e.g., llvm.aarch64.set.fpmr)
-
The actual FP8 intrinsic (e.g., llvm.aarch64.neon.fp8.cvt)
To ensure the FPMR setting is preserved correctly, we currently mark the fp8 intrinsic as IntrInaccessibleMemOnly. This approach guarantees execution order but blocks common transformations such as redundant write of FPRM by elimination and code motion.
For example, given the C code:
float16x8_t v1 = vcvt2_f16_mf8_fpm(op1, fpm);
float16x8_t v2 = vcvt2_f16_mf8_fpm(op2, fpm);
The current LLVM IR emitted is:
call void @llvm.aarch64.set.fpmr(i64 %fpm)
%v1 = call <8 x half> @llvm.aarch64.neon.fp8.cvtl2.v8f16.v8i8(<8 x i8> %op1)
call void @llvm.aarch64.set.fpmr(i64 %fpm)
%v2 = call <8 x half> @llvm.aarch64.neon.fp8.cvtl2.v8f16.v8i8(<8 x i8> %op2)
Despite identical FPMR values, the compiler cannot remove or hoist these calls due to the strict memory side effects.
Goals
• Ideally an IR based representation rather than pushing everything to the code generator.
• Enable standard optimizations (e.g., hoisting, CSE) for set.fpmr when safe.
• Maintain correct ordering and dependency between FPMR writes and FP8 intrinsics.
• Avoid over-restricting transformations with IntrInaccessibleMemOnly.
• Provide a scalable and reusable solution for future instruction encodings with similar properties (e.g., SME ZA and ZT0 registers). We have a similar investigation ongoing to track the liveness of ZA state at the IR level where we’re hoping to build on top of the FPMR design.
Proposed Approaches
Option 1: Model FPMR write via Address Space and Memory Access
1.1. Use Store to a Global in a custom Address Space and create new metadata to represent dependencies between the store and the fp8 intrinsic
In this option we replace IntrInaccessibleMemOnly by IntrReadMem. We cannot use IntrNoMem because it prevents any further alias analysis.
Replace @llvm.aarch64.set.fpmr with a store to a global variable in a non-default address space (e.g., addrspace(1)):
store i8 fpm_value, ptr addrspace(1) @globalFPMRAddr
FP8 intrinsics are then modeled as reads from this address with the use of a new metadata “onlyreads.addrspace”:
%v = call <4 x half> @llvm.aarch64.neon.fp8.fdot2(...), !onlyreads.addrspace !0
!0 = !{!” globalFPMRAddr”, i32 1}
This allows alias analysis to determine that repeated stores with the same value can be optimized.
Pros:
• Leverages existing memory modeling and analysis infrastructure.
• Makes dependencies explicit via memory SSA and aliasing.
• No changes required to existing FP8 intrinsic function definitions.
Cons:
• Requires representing a system register as a memory object.
• Need to add a backend hook to assume that different address spaces cannot alias.
Open Questions:
Is it sensible to use an address space to represent a single register? Or are we changing the purpose of address space?
Possible Extension:
• !onlywrites.addrspace
• !readwrite.addrspace
This approach preserves optimization opportunities when aliasing is known to be disjoint.
1.2. Use Store to a Global in a custom Address Space and encode relation via pointers to global address space in the fp8 intrinsic
Replace @llvm.aarch64.set.fpmr with a store to a global variable in a non-default address space (e.g., addrspace(1)) and change the fp8 intrinsics to have an extra parameter to point to the FPMR global pointer.
store i8 23, ptr addrspace(1) @globalFPMRAddr
FP8 intrinsics are then modeled as reads from this address with the use of an extra function parameter and IntrArgMemOnly:
%v = call <4 x half> @llvm.aarch64.neon.fp8.fdot2( ..., ptr addrspace(1) @globalFPMRAddr )
Pros:
• Leverages existing memory modeling and analysis infrastructure.
• Makes dependencies explicit via memory SSA and aliasing.
Cons:
• All intrinsics affected by FPMR will need to be changed to add the extra parameter and further changes in codegen to ignore the extra parameter.
• It might not scale nicely if we want to add the same solution for SME ZA and ZT registers
Option 2: Extend IntrInaccessibleMemOnly into an Enum
Redefine the IntrInaccessibleMemOnly bit as an enumeration to distinguish among different kinds of inaccessible memory effects, such as:
• InaccessibleMemory (legacy behavior)
• FPMRRegister
• OtherRegisterX
This can allow passes to recognize cases where transformations are still valid (e.g., when setting the same register value repeatedly).
Pros:
• Avoids using address-spaces and memory modeling.
• No changes are needed in the implementation of FP8.
Cons:
• Requires updating all passes that interpret IntrInaccessibleMemOnly to ensure they distinguish between the different enum values.
• May be less general than Option 1.
Option 3: Backend-Only Handling
Defer all FPMR modeling to the backend and don’t represent the dependency in IR.
%v = call <4 x half> @llvm.aarch64.neon.fp8.fdot2( ..., fpm)
Pros:
• Simplifies IR semantics.
Cons:
• Prevents mid-end optimizations.
• Makes IR opaque to analysis.
• We were hoping to avoid this path.
• Requires changing all affected intrinsics to add an extra parameter like option 1.2 albeit this time with the parameter being the FPMR value rather than a pointer to the value.
Recommendation
We believe Option 1, especially 1.1 using address space metadata, provides the best balance between performance, correctness, and extensibility.
We are looking for feedback on:
• The suitability of address space and metadata modeling.
• Feasibility of extending IntrInaccessibleMemOnly semantics.
• Any alternative idioms used for similar problems (e.g., SME ZA and ZT)