[WIP/RFC] broaden the scope of `hash` #37964

rfourquet · 2020-10-09T16:41:44Z

Currently hash(x, h::UInt)::UInt is used for hashing values (x here) to be
used as keys in a hash-container. The h::UInt parameter is used to combine
hashes of subobjects in order to compute the hash of a parent object.

This PR suggests to extend the scope of h by having its generic signature be
hash(x, h::H)::H, where H is any type, with accompanying methods:

hashinit(::Type{H}) = H() (e.g. hashinit(UInt) == UInt(0))
hashdigest(h::H)::Integer gives back the result of the hashing process
when it's finished (e.g. hashdigest(h::UInt) = h)

It doesn't make writing hash methods much more complicated: basically, just don't consider h to be an integer, i.e. you can't directly xor its value; this mean replacing things like h = hash(x, xor(h, 0x1234567)) by ...; h = hash(x, hash(0x1234567, h)); ....

Motivation 1: cryptography in non-container context

Once in a while, I would like to get a cryptographic hash of some objects, to
compare them, uniquify them or whatnot, without having to compare them all
against each other. An ad-hoc cryptohash function could be written, but it would be
great to be able to re-use the pre-existing implementations (although not all
of them are compatible with cryptographic requirements, far from it).

One concrete example is testing the reproducibility of an RNG. Instead of
having in your tests @test rand(MyRng(0), 100) = [... long list of numbers ...] for different seeds, you could do

@test hashdigest(hash(rand(MyRng(0), 1000), SHA1Hash())) = 0xfe9160330ac0a5c265517b6831a92414c6ec889f

Here is a little implementation of this idea (of course this would live in a package),
working atop this PR.

using SHA
import Base: hash
using BitIntegers

BitIntegers.@define_integers 160 SignedSHA1Sum SHA1Sum

struct SHA1Hash
    buf::Vector{UInt8}
    ctx::SHA1_CTX
end

SHA1Hash() = SHA1Hash(zeros(UInt8, 0), SHA1_CTX())

function Base.hash_uint(x::Base.BitInteger, s::SHA1Hash)
    resize!(s.buf, sizeof(x))
    reinterpret(typeof(x), s.buf)[1] = x
    SHA.update!(s.ctx, s.buf)
    s
end

Base.hashdigest(s::SHA1Hash) = reinterpret(SHA1Sum, SHA.digest!(s.ctx))[1]

Base.show(io::IO, s::SHA1Hash) = show(io, Base.hashdigest(SHA1Hash(UInt8[], copy(s.ctx))))

# can't use default method for vectors, as it doesn't read all elements
function Base.hash(a::AbstractArray, s::SHA1Hash)
    s = hash(0x85285f3d4c50ecee82ec4864a4e2a1e3, s)
    for x in a
        s = hash(x, s)
        s = hash(0x85285f3d4c50ecee82ec4864a4e2a1e3, s)
    end
    s
end

@test Base.hashdigest(hash([1, 2, 3, 4], SHA1Hash())) == 0x17a91ea956706f0383e4450a5359dc1ffa49485d

Motivation 2: provide alternate hash/isequal couples to be used by `Dict`-like containers

I can think of three examples:

use faster hashing in some contexts; for example, in Dict, hash(x) == hash(y) implies isequal(x, y),
and as all our numbers hash the same if they represent the same value, hash has to go its way
to enforce this invariant. This can have some performance implications
use different meaning of equality in Dict, e.g. === (would behave like
IdDict), or (x, y) -> typeof(x) == typeof(y) && isequal(x, y).
related to 2), use a work-around hash when default hash fails, e.g.

hash(1:Int128(2)^65)
ERROR: InexactError: trunc(Int64, 36893488147419103232)

As a proof of concept, this PR includes a small modification of Dict to
accomodate this: Dict{K,V} is replaced by HashDict{K,V,H} where H is a type of hash,
and Dict{K,V} is an alias for HashDict{K,V,UInt} (given how small this patch is,
I would find value in having this in Base rather than in a package, which would involve a
lot of duplication; but if this is taken seriously, this should move into another PR...)

To illustrate 1), we define a Context{T} hash type which can hash only
subtypes of T. We define corresponding hash for BigInt, UnitRange and Set
to show how this can compose.

import Base: hash, HashDict

struct Context{T}
    h::UInt
end

Context{T}() where {T} = Context{T}(UInt(0))
Base.hashdigest(s::Context) = s.h

hash(x::T, h::Context{T}) where {T<:Integer} = Context{T}(Base.hash_integer(x, h.h))

# ambiguities
hash(x::Int64, h::Context{T}) where {Int64<:T<:Integer} = Context{T}(Base.hash_integer(x, h.h))
hash(x::UInt64, h::Context{T}) where {Int64<:T<:Integer} = Context{T}(Base.hash_integer(x, h.h))

hash(x::U, h::Context{U}) where {T,U<:AbstractUnitRange{T}} =
    Context{U}(hash(last(x), hash(first(x), Context{T}(0x7222d2d5c0db9f56 ⊻ h.h))).h)

function hash(s::U, h::Context{U}) where {T,U<:AbstractSet{T}}
    hv = Base.hashs_seed
    for x in s
        hv ⊻= hash(x, Context{T}()).h
    end
    Context{U}(hash(hv, h.h))
end

Demo:

julia> n = big(2)^100; d = Dict(n => 1)
Dict{BigInt, Int64} with 1 entry:
  1267650600228229401496703205376 => 1

julia> @btime $d[$n];
  42.216 ns (0 allocations: 0 bytes)

julia> d = HashDict{BigInt, Int, Context{Integer}}(s => 1)
HashDict{BigInt, Int64, Context{Integer}} with 1 entry:
  1267650600228229401496703205376 => 1

julia> @btime $d[$n];
  19.701 ns (0 allocations: 0 bytes)

julia> d[1] # 1 isa Integer, `Int` keys can be recognized by `d`
ERROR: KeyError: key 1 not found

julia> d[1.0] # but not so for floats
ERROR: MethodError: no method matching hash_uint(::UInt64, ::Context{Integer})
[...]

julia> r = 1:big(2)^60; d = Dict(r=>1)
Dict{UnitRange{BigInt}, Int64} with 1 entry:
  1:1152921504606846976 => 1

julia> @btime $d[$r] = 1;
  372.277 ms (5552083 allocations: 105.90 MiB)

julia> d = HashDict{typeof(r), Int, Context{typeof(r)}}(r => 1)
HashDict{UnitRange{BigInt}, Int64, Context{UnitRange{BigInt}}} with 1 entry:
  1:1152921504606846976 => 1

julia> @btime $d[$r] = 1;
  24.537 ns (0 allocations: 0 bytes)
  
julia> s = Set([r]); d = Dict(s => 1); @btime $d[$s];
  374.994 ms (5552083 allocations: 105.90 MiB)

julia> d = HashDict{typeof(s), Int, Context{typeof(s)}}(s => 1); @btime $d[$s];
  41.511 ns (0 allocations: 0 bytes)

As a last example, here how to implement an IdDict clone via HashDict:

struct IdHash
    h::UInt
end

IdHash() = IdHash(UInt(0))
Base.hashdigest(h::IdHash) = h.h

hash(x, ::IdHash) = IdHash(objectid(x))
hash(x::Int, ::IdHash) = IdHash(Base.hash_uint(x % UInt))
hash(x::UInt, ::IdHash) = IdHash(Base.hash_uint(x))
hash(x::Float64, ::IdHash) = IdHash(objectid(x))

Base.isequal(::Type{IdHash}, x, y) = x === y

# Demo
julia> d = HashDict{Any,Any,IdHash}(1 => 1, 1.0 => 2)
HashDict{Any, Any, IdHash} with 2 entries:
  1.0 => 2
  1   => 1

julia> @btime $d[1];
  11.786 ns (0 allocations: 0 bytes)

julia> f = IdDict(1 => 1, 1.0 => 2);

julia> @btime $f[1];
  20.101

JeffBezanson · 2020-10-09T20:29:51Z

replacing things like h = hash(x, xor(h, 0x1234567)) by ...; h = hash(x, hash(0x1234567, h)); ...

I think we should have a dedicated mixing function for this. Am I right that mixing hash values can typically be much faster than computing hash values? E.g. in our case the mixing function would not need to know about isequal.

rfourquet · 2020-10-10T10:00:40Z

I think we should have a dedicated mixing function for this. Am I right that mixing hash values can typically be much faster than computing hash values?

Yes good catch. For example, for a random h::UInt and fixed random mixing value a::UInt, we get

 julia> @btime hash($a, $(Ref(h))[])
  4.299 ns (0 allocations: 0 bytes)
0x04b5a55ce4d88c58

julia> @btime $a ⊻ $(Ref(h))[]
  1.749 ns (0 allocations: 0 bytes)
0x5c3b8e8960ef988b

There is an example in the PR where I kept the two versions (the hx function in "float.jl"), but having a hashmix(a, h) which defaults to xor for integers would be more general.

tisztamo · 2020-10-28T11:04:23Z

I would be happy to see this merged. My motivation: https://discourse.julialang.org/t/dictionary-with-custom-hash-function/49168

StefanKarpinski · 2020-11-16T19:50:35Z

but having a hashmix(a, h) which defaults to xor for integers would be more general.

You generally want to do something asymmetrical in the arguments. Something like a - 9b, or a ⊻ 9b might be a better mixing functions: they're asymmetrical in the two arguments, they cascade bits a bit better than just an xor, and they're still very cheap since it's just an lea instruction and simple binary instruction operation. If we wanted to spread the bits from both arguments around a bit more then something that does two lea instructions and a binary mixing operation might be better — 9a ⊻ 5b would work and should be efficient.

The criteria for a constant integer multiplication being expressible as an lea instruction seems to be that it's either exactly a small power of two (2, 4, 8) or one more than that (3, 5, 9). Multiplying by an even number is lossy mod 2^n so we want to multiply by one of the odd factors, not one of the powers of two. It seems better to use the higher factors since they spread the bits around more for small integer inputs. It's unclear to me if it's better to use +, - or ⊻ for the final combining step: - has the advantage that for small positive integer inputs it causes a lot of the bits of the output to be set, whereas that's not the case for + and ⊻. People designing RNGs tend to like sticking with a well-understood ring structure and use + and * operations, but that's largely so that they can prove things about RNG periods, which we don't care about here; ⊻, on the other hand, doesn't have a nice ring structure, but that's kind of good since it makes the output harder to analyze. If I had to guess at a decent cheap mixing function, I would guess that 9a ⊻ 5b might be good.

rfourquet added 2 commits October 9, 2020 18:36

broaden the scope of hash

b3e36d3

introduce HashDict, a more general Dict

4553395

rfourquet changed the title ~~broaden the scope of hash~~ [WIP/RFC] broaden the scope of hash Oct 10, 2020

rfourquet mentioned this pull request Nov 6, 2020

make fmpz <: Signed <: Integer Nemocas/Nemo.jl#700

Closed

rfourquet mentioned this pull request Dec 7, 2021

faster hash for specific types #43355

Open

rfourquet added the hashing label Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP/RFC] broaden the scope of `hash` #37964

[WIP/RFC] broaden the scope of `hash` #37964

Uh oh!

rfourquet commented Oct 9, 2020 •

edited

Loading

Uh oh!

JeffBezanson commented Oct 9, 2020

Uh oh!

rfourquet commented Oct 10, 2020

Uh oh!

tisztamo commented Oct 28, 2020

Uh oh!

StefanKarpinski commented Nov 16, 2020

Uh oh!

Uh oh!

Uh oh!

[WIP/RFC] broaden the scope of hash #37964

Are you sure you want to change the base?

[WIP/RFC] broaden the scope of hash #37964

Uh oh!

Conversation

rfourquet commented Oct 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation 1: cryptography in non-container context

Motivation 2: provide alternate hash/isequal couples to be used by Dict-like containers

Uh oh!

JeffBezanson commented Oct 9, 2020

Uh oh!

rfourquet commented Oct 10, 2020

Uh oh!

tisztamo commented Oct 28, 2020

Uh oh!

StefanKarpinski commented Nov 16, 2020

Uh oh!

Uh oh!

[WIP/RFC] broaden the scope of `hash` #37964

[WIP/RFC] broaden the scope of `hash` #37964

rfourquet commented Oct 9, 2020 •

edited

Loading

Motivation 2: provide alternate hash/isequal couples to be used by `Dict`-like containers