Skip to content

[WIP/RFC] broaden the scope of hash #37964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

[WIP/RFC] broaden the scope of hash #37964

wants to merge 2 commits into from

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Oct 9, 2020

Currently hash(x, h::UInt)::UInt is used for hashing values (x here) to be
used as keys in a hash-container. The h::UInt parameter is used to combine
hashes of subobjects in order to compute the hash of a parent object.

This PR suggests to extend the scope of h by having its generic signature be
hash(x, h::H)::H, where H is any type, with accompanying methods:

  • hashinit(::Type{H}) = H() (e.g. hashinit(UInt) == UInt(0))
  • hashdigest(h::H)::Integer gives back the result of the hashing process
    when it's finished (e.g. hashdigest(h::UInt) = h)

It doesn't make writing hash methods much more complicated: basically, just don't consider h to be an integer, i.e. you can't directly xor its value; this mean replacing things like h = hash(x, xor(h, 0x1234567)) by ...; h = hash(x, hash(0x1234567, h)); ....

Motivation 1: cryptography in non-container context

Once in a while, I would like to get a cryptographic hash of some objects, to
compare them, uniquify them or whatnot, without having to compare them all
against each other. An ad-hoc cryptohash function could be written, but it would be
great to be able to re-use the pre-existing implementations (although not all
of them are compatible with cryptographic requirements, far from it).

One concrete example is testing the reproducibility of an RNG. Instead of
having in your tests @test rand(MyRng(0), 100) = [... long list of numbers ...] for different seeds, you could do

@test hashdigest(hash(rand(MyRng(0), 1000), SHA1Hash())) = 0xfe9160330ac0a5c265517b6831a92414c6ec889f

Here is a little implementation of this idea (of course this would live in a package),
working atop this PR.

using SHA
import Base: hash
using BitIntegers

BitIntegers.@define_integers 160 SignedSHA1Sum SHA1Sum

struct SHA1Hash
    buf::Vector{UInt8}
    ctx::SHA1_CTX
end

SHA1Hash() = SHA1Hash(zeros(UInt8, 0), SHA1_CTX())

function Base.hash_uint(x::Base.BitInteger, s::SHA1Hash)
    resize!(s.buf, sizeof(x))
    reinterpret(typeof(x), s.buf)[1] = x
    SHA.update!(s.ctx, s.buf)
    s
end

Base.hashdigest(s::SHA1Hash) = reinterpret(SHA1Sum, SHA.digest!(s.ctx))[1]

Base.show(io::IO, s::SHA1Hash) = show(io, Base.hashdigest(SHA1Hash(UInt8[], copy(s.ctx))))

# can't use default method for vectors, as it doesn't read all elements
function Base.hash(a::AbstractArray, s::SHA1Hash)
    s = hash(0x85285f3d4c50ecee82ec4864a4e2a1e3, s)
    for x in a
        s = hash(x, s)
        s = hash(0x85285f3d4c50ecee82ec4864a4e2a1e3, s)
    end
    s
end

@test Base.hashdigest(hash([1, 2, 3, 4], SHA1Hash())) == 0x17a91ea956706f0383e4450a5359dc1ffa49485d

Motivation 2: provide alternate hash/isequal couples to be used by Dict-like containers

I can think of three examples:

  1. use faster hashing in some contexts; for example, in Dict, hash(x) == hash(y) implies isequal(x, y),
    and as all our numbers hash the same if they represent the same value, hash has to go its way
    to enforce this invariant. This can have some performance implications

  2. use different meaning of equality in Dict, e.g. === (would behave like
    IdDict), or (x, y) -> typeof(x) == typeof(y) && isequal(x, y).

  3. related to 2), use a work-around hash when default hash fails, e.g.

hash(1:Int128(2)^65)
ERROR: InexactError: trunc(Int64, 36893488147419103232)

As a proof of concept, this PR includes a small modification of Dict to
accomodate this: Dict{K,V} is replaced by HashDict{K,V,H} where H is a type of hash,
and Dict{K,V} is an alias for HashDict{K,V,UInt} (given how small this patch is,
I would find value in having this in Base rather than in a package, which would involve a
lot of duplication; but if this is taken seriously, this should move into another PR...)

To illustrate 1), we define a Context{T} hash type which can hash only
subtypes of T. We define corresponding hash for BigInt, UnitRange and Set
to show how this can compose.

import Base: hash, HashDict

struct Context{T}
    h::UInt
end

Context{T}() where {T} = Context{T}(UInt(0))
Base.hashdigest(s::Context) = s.h

hash(x::T, h::Context{T}) where {T<:Integer} = Context{T}(Base.hash_integer(x, h.h))

# ambiguities
hash(x::Int64, h::Context{T}) where {Int64<:T<:Integer} = Context{T}(Base.hash_integer(x, h.h))
hash(x::UInt64, h::Context{T}) where {Int64<:T<:Integer} = Context{T}(Base.hash_integer(x, h.h))

hash(x::U, h::Context{U}) where {T,U<:AbstractUnitRange{T}} =
    Context{U}(hash(last(x), hash(first(x), Context{T}(0x7222d2d5c0db9f56  h.h))).h)

function hash(s::U, h::Context{U}) where {T,U<:AbstractSet{T}}
    hv = Base.hashs_seed
    for x in s
        hv ⊻= hash(x, Context{T}()).h
    end
    Context{U}(hash(hv, h.h))
end

Demo:

julia> n = big(2)^100; d = Dict(n => 1)
Dict{BigInt, Int64} with 1 entry:
  1267650600228229401496703205376 => 1

julia> @btime $d[$n];
  42.216 ns (0 allocations: 0 bytes)

julia> d = HashDict{BigInt, Int, Context{Integer}}(s => 1)
HashDict{BigInt, Int64, Context{Integer}} with 1 entry:
  1267650600228229401496703205376 => 1

julia> @btime $d[$n];
  19.701 ns (0 allocations: 0 bytes)

julia> d[1] # 1 isa Integer, `Int` keys can be recognized by `d`
ERROR: KeyError: key 1 not found

julia> d[1.0] # but not so for floats
ERROR: MethodError: no method matching hash_uint(::UInt64, ::Context{Integer})
[...]

julia> r = 1:big(2)^60; d = Dict(r=>1)
Dict{UnitRange{BigInt}, Int64} with 1 entry:
  1:1152921504606846976 => 1

julia> @btime $d[$r] = 1;
  372.277 ms (5552083 allocations: 105.90 MiB)

julia> d = HashDict{typeof(r), Int, Context{typeof(r)}}(r => 1)
HashDict{UnitRange{BigInt}, Int64, Context{UnitRange{BigInt}}} with 1 entry:
  1:1152921504606846976 => 1

julia> @btime $d[$r] = 1;
  24.537 ns (0 allocations: 0 bytes)
  
julia> s = Set([r]); d = Dict(s => 1); @btime $d[$s];
  374.994 ms (5552083 allocations: 105.90 MiB)

julia> d = HashDict{typeof(s), Int, Context{typeof(s)}}(s => 1); @btime $d[$s];
  41.511 ns (0 allocations: 0 bytes)

As a last example, here how to implement an IdDict clone via HashDict:

struct IdHash
    h::UInt
end

IdHash() = IdHash(UInt(0))
Base.hashdigest(h::IdHash) = h.h

hash(x, ::IdHash) = IdHash(objectid(x))
hash(x::Int, ::IdHash) = IdHash(Base.hash_uint(x % UInt))
hash(x::UInt, ::IdHash) = IdHash(Base.hash_uint(x))
hash(x::Float64, ::IdHash) = IdHash(objectid(x))

Base.isequal(::Type{IdHash}, x, y) = x === y

# Demo
julia> d = HashDict{Any,Any,IdHash}(1 => 1, 1.0 => 2)
HashDict{Any, Any, IdHash} with 2 entries:
  1.0 => 2
  1   => 1

julia> @btime $d[1];
  11.786 ns (0 allocations: 0 bytes)

julia> f = IdDict(1 => 1, 1.0 => 2);

julia> @btime $f[1];
  20.101

@JeffBezanson
Copy link
Member

replacing things like h = hash(x, xor(h, 0x1234567)) by ...; h = hash(x, hash(0x1234567, h)); ...

I think we should have a dedicated mixing function for this. Am I right that mixing hash values can typically be much faster than computing hash values? E.g. in our case the mixing function would not need to know about isequal.

@rfourquet rfourquet changed the title broaden the scope of hash [WIP/RFC] broaden the scope of hash Oct 10, 2020
@rfourquet
Copy link
Member Author

I think we should have a dedicated mixing function for this. Am I right that mixing hash values can typically be much faster than computing hash values?

Yes good catch. For example, for a random h::UInt and fixed random mixing value a::UInt, we get

 julia> @btime hash($a, $(Ref(h))[])
  4.299 ns (0 allocations: 0 bytes)
0x04b5a55ce4d88c58

julia> @btime $a  $(Ref(h))[]
  1.749 ns (0 allocations: 0 bytes)
0x5c3b8e8960ef988b

There is an example in the PR where I kept the two versions (the hx function in "float.jl"), but having a hashmix(a, h) which defaults to xor for integers would be more general.

@tisztamo
Copy link
Contributor

I would be happy to see this merged. My motivation: https://discourse.julialang.org/t/dictionary-with-custom-hash-function/49168

@StefanKarpinski
Copy link
Member

but having a hashmix(a, h) which defaults to xor for integers would be more general.

You generally want to do something asymmetrical in the arguments. Something like a - 9b, or a ⊻ 9b might be a better mixing functions: they're asymmetrical in the two arguments, they cascade bits a bit better than just an xor, and they're still very cheap since it's just an lea instruction and simple binary instruction operation. If we wanted to spread the bits from both arguments around a bit more then something that does two lea instructions and a binary mixing operation might be better — 9a ⊻ 5b would work and should be efficient.

The criteria for a constant integer multiplication being expressible as an lea instruction seems to be that it's either exactly a small power of two (2, 4, 8) or one more than that (3, 5, 9). Multiplying by an even number is lossy mod 2^n so we want to multiply by one of the odd factors, not one of the powers of two. It seems better to use the higher factors since they spread the bits around more for small integer inputs. It's unclear to me if it's better to use +, - or for the final combining step: - has the advantage that for small positive integer inputs it causes a lot of the bits of the output to be set, whereas that's not the case for + and . People designing RNGs tend to like sticking with a well-understood ring structure and use + and * operations, but that's largely so that they can prove things about RNG periods, which we don't care about here; , on the other hand, doesn't have a nice ring structure, but that's kind of good since it makes the output harder to analyze. If I had to guess at a decent cheap mixing function, I would guess that 9a ⊻ 5b might be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants