Stable hashing of MLIR types

Our compiler requires computing a hash that has to be “stable”: primarily, not change between two compilations (i.e. between two independent processes). Thus, the default hashing (hashes a pointer) doesn’t work for us, unfortunately.

So far, the best way to achieve this - without rolling out custom dispatch - seems to be “print to string → hash the string” but this is rather slow in general and is extremely slow in edge cases (e.g. when dealing with quantized types that store arrays of scales).

Interestingly, MLIR itself solves this problem, as types are uniqued. Yet this is guarded behind templatized type uniquer / type storage facility. Briefly looking at that, in order to get the type storage hash, one has to dispatch statically [1] and this requires pretty much hand-rolled TypeSwitch or something similar.

Is there any good facility that allows to access the “real hash” of a type (*TypeStorage::hashKey()) via some type-erasure / dynamic dispatch? (perhaps I am missing something).
Alternatively, does it make sense to expose the existing infrastructure in some way to allow easy access to a “stable” hash? As for the API, I envision something conceptually similar to:

mlir::Type someType = getSomeTypeFromSomewhere(...);
someType.getStableHash();

[1]: To add here, the problem doesn’t really exist when constructing a type, because at that point the real (“most-derived” if I may) C++ type is used.

It would be a useful feature to be able to produce a stable hash for the in-memory IR without printing it, it’s been requested a few times, but we always fallback to printing instead (bytecode is more comprehensive than the text by the way).

Now the internal hash is not deterministic across process execution though, so I don’t think exposing it through an API would help.

See this snippet:

/// In LLVM_ENABLE_ABI_BREAKING_CHECKS builds, the seed is non-deterministic
/// per process (address of a function in LLVMSupport) to prevent having users
/// depend on the particular hash values. On platforms without ASLR, this is
/// still likely non-deterministic per build.
inline uint64_t get_execution_seed() {
#if LLVM_ENABLE_ABI_BREAKING_CHECKS
  return static_cast<uint64_t>(
      reinterpret_cast<uintptr_t>(&install_fatal_error_handler));
#else
  return 0xff51afd7ed558ccdULL;
#endif
}

I haven’t thought of the byte-code, thanks! It might be worth checking how it performs in comparison.

Any specific reason that this is not implemented yet (apart from the lack of engineering time to spend on this perhaps) i.e. blockers / concerns / etc.? I guess the general rationale applies also to attributes (they may benefit from this also).

Thanks for the heads-up. We disable this via a CMake option (fingers crossed it’s not removed in the future).

Just lack of engineering time: especially when there is a working solution (hashing bytecode/text).

That’ll works for you, but that makes it a deal-break to start building core APIs on top of it unfortunately.

I see. Ok, I think we’ll discuss inside of our project, if there’s enough interest, we might come up with an RFC, patches.

Could you clarify? (Not sure what “start building core APIs” means). We have a luxury (not really) of maintaining our own fork with custom patches. It’s curious to see what problems we may get if we just blatantly disable all ABI breaking checks.

I meant we can’t offer a llvm::hash_value mlir::stableHash(mlir::Operation *op); API implemented on something that only works with a very specific CMake option.

Ah, right, somehow I missed this. I guess, in order to have the core API, there’s a larger open of defining the stable hashing procedure given non-stable hash seed, which sounds like a huge pain on its own… One pretty much needs to maintain the clean separation between “hashing with a random seed” and “stable hashing” somehow. Or have a MLIR-specific setup within LLVM to disable the non-deterministic seed?

From what I can see, the patch ([Hashing] Use a non-deterministic seed if LLVM_ENABLE_ABI_BREAKING_CH… · llvm/llvm-project@ce80c80 · GitHub) is a debug-aiding thing?:

LLVM_ENABLE_ABI_BREAKING_CHECKS defaults to WITH_ASSERTS and is
enabled in an assertion build.

In a non-assertion build, get_execution_seed returns the fixed value
regardless of NDEBUG. Removing a variable load yields noticeable
size/performance improvement.

So perhaps there’s a way to be “special” in MLIR but this is anyhow a wider discussion within LLVM already (so out of scope for this one). Anyhow, thanks for the prompt feedback!

1 Like

An afterthought: in theory, if there is an API that just allows accessing the “internal hashing” procedure (the {attribute,type}Storage based hashing used during uniqued creation / lookup), we may avoid changing the LLVM’s current policy. While still useless without the build option trickery, at least it is less intrusive so might be a way out in case changing the seed at LLVM level would be challenging (e.g. no consensus reached).

While still useless without the build option trickery, at least it is less intrusive so might be a way out

I don’t quite follow this sentence? However I read it, it seems to points to “useless”, so how is it “a way out”?

I’m thinking of something like the following:

  • Expose “*TypeStorage::getHash” for types, “*AttributeStorage::getHash” for attributes
    • On their own, they just return the same hash that’s used inside of MLIR to guarantee uniqueness of types / attributes and it is not stable (across re-compilations) due to underlying LLVM infrastructure
  • Downstream projects (in a broad sense), could switch off the randomized seed in LLVM
  • Two pieces of the puzzle come together: 1) API that provides non-pointer-based hashing (see bullet 1); 2) guaranteed LLVM seed (and thus deterministic hashing) → now one can get a stable (across re-compilations) hash for types / attributes

P.S.: again, this is mostly a “bad case” scenario if LLVM infrastructure couldn’t be changed for some reason.