Stable hashing of MLIR types

andrey-golubev · May 26, 2025, 10:44am

Our compiler requires computing a hash that has to be “stable”: primarily, not change between two compilations (i.e. between two independent processes). Thus, the default hashing (hashes a pointer) doesn’t work for us, unfortunately.

So far, the best way to achieve this - without rolling out custom dispatch - seems to be “print to string → hash the string” but this is rather slow in general and is extremely slow in edge cases (e.g. when dealing with quantized types that store arrays of scales).

Interestingly, MLIR itself solves this problem, as types are uniqued. Yet this is guarded behind templatized type uniquer / type storage facility. Briefly looking at that, in order to get the type storage hash, one has to dispatch statically [1] and this requires pretty much hand-rolled TypeSwitch or something similar.

Is there any good facility that allows to access the “real hash” of a type (*TypeStorage::hashKey()) via some type-erasure / dynamic dispatch? (perhaps I am missing something).
Alternatively, does it make sense to expose the existing infrastructure in some way to allow easy access to a “stable” hash? As for the API, I envision something conceptually similar to:

mlir::Type someType = getSomeTypeFromSomewhere(...);
someType.getStableHash();

[1]: To add here, the problem doesn’t really exist when constructing a type, because at that point the real (“most-derived” if I may) C++ type is used.

mehdi_amini · May 26, 2025, 11:05am

It would be a useful feature to be able to produce a stable hash for the in-memory IR without printing it, it’s been requested a few times, but we always fallback to printing instead (bytecode is more comprehensive than the text by the way).

Now the internal hash is not deterministic across process execution though, so I don’t think exposing it through an API would help.

See this snippet:

/// In LLVM_ENABLE_ABI_BREAKING_CHECKS builds, the seed is non-deterministic
/// per process (address of a function in LLVMSupport) to prevent having users
/// depend on the particular hash values. On platforms without ASLR, this is
/// still likely non-deterministic per build.
inline uint64_t get_execution_seed() {
#if LLVM_ENABLE_ABI_BREAKING_CHECKS
  return static_cast<uint64_t>(
      reinterpret_cast<uintptr_t>(&install_fatal_error_handler));
#else
  return 0xff51afd7ed558ccdULL;
#endif
}

andrey-golubev · May 26, 2025, 11:24am

I haven’t thought of the byte-code, thanks! It might be worth checking how it performs in comparison.

Any specific reason that this is not implemented yet (apart from the lack of engineering time to spend on this perhaps) i.e. blockers / concerns / etc.? I guess the general rationale applies also to attributes (they may benefit from this also).

Thanks for the heads-up. We disable this via a CMake option (fingers crossed it’s not removed in the future).

mehdi_amini · May 26, 2025, 11:29am

Just lack of engineering time: especially when there is a working solution (hashing bytecode/text).

That’ll works for you, but that makes it a deal-break to start building core APIs on top of it unfortunately.

andrey-golubev · May 26, 2025, 11:33am

I see. Ok, I think we’ll discuss inside of our project, if there’s enough interest, we might come up with an RFC, patches.

Could you clarify? (Not sure what “start building core APIs” means). We have a luxury (not really) of maintaining our own fork with custom patches. It’s curious to see what problems we may get if we just blatantly disable all ABI breaking checks.

mehdi_amini · May 26, 2025, 11:37am

I meant we can’t offer a llvm::hash_value mlir::stableHash(mlir::Operation *op); API implemented on something that only works with a very specific CMake option.

andrey-golubev · May 26, 2025, 11:53am

Ah, right, somehow I missed this. I guess, in order to have the core API, there’s a larger open of defining the stable hashing procedure given non-stable hash seed, which sounds like a huge pain on its own… One pretty much needs to maintain the clean separation between “hashing with a random seed” and “stable hashing” somehow. Or have a MLIR-specific setup within LLVM to disable the non-deterministic seed?

From what I can see, the patch ([Hashing] Use a non-deterministic seed if LLVM_ENABLE_ABI_BREAKING_CH… · llvm/llvm-project@ce80c80 · GitHub) is a debug-aiding thing?:

LLVM_ENABLE_ABI_BREAKING_CHECKS defaults to WITH_ASSERTS and is
enabled in an assertion build.

In a non-assertion build, get_execution_seed returns the fixed value
regardless of NDEBUG. Removing a variable load yields noticeable
size/performance improvement.

So perhaps there’s a way to be “special” in MLIR but this is anyhow a wider discussion within LLVM already (so out of scope for this one). Anyhow, thanks for the prompt feedback!

andrey-golubev · May 27, 2025, 8:27am

An afterthought: in theory, if there is an API that just allows accessing the “internal hashing” procedure (the {attribute,type}Storage based hashing used during uniqued creation / lookup), we may avoid changing the LLVM’s current policy. While still useless without the build option trickery, at least it is less intrusive so might be a way out in case changing the seed at LLVM level would be challenging (e.g. no consensus reached).

mehdi_amini · May 27, 2025, 6:45pm

While still useless without the build option trickery, at least it is less intrusive so might be a way out

I don’t quite follow this sentence? However I read it, it seems to points to “useless”, so how is it “a way out”?

andrey-golubev · May 28, 2025, 2:44pm

I’m thinking of something like the following:

Expose “*TypeStorage::getHash” for types, “*AttributeStorage::getHash” for attributes
- On their own, they just return the same hash that’s used inside of MLIR to guarantee uniqueness of types / attributes and it is not stable (across re-compilations) due to underlying LLVM infrastructure
Downstream projects (in a broad sense), could switch off the randomized seed in LLVM
Two pieces of the puzzle come together: 1) API that provides non-pointer-based hashing (see bullet 1); 2) guaranteed LLVM seed (and thus deterministic hashing) → now one can get a stable (across re-compilations) hash for types / attributes

P.S.: again, this is mostly a “bad case” scenario if LLVM infrastructure couldn’t be changed for some reason.

Topic		Replies	Views
Deterministic type hash MLIR	2	301	December 27, 2020
Inconsistent TypeID after linking libMLIR.so MLIR	3	565	October 31, 2020
MLIR generic IR stability and upgradability MLIR	3	1338	September 23, 2022
Caching dialect specific types? MLIR	6	442	June 24, 2021
[RFC] MLIR Bytecode: a stable serialization format MLIR	9	1074	June 2, 2023

Stable hashing of MLIR types

Related topics