Design of hash(::BioSequence). What to do? #243

jakobnissen · 2022-07-05T08:02:03Z

So the design of BioSequences makes it difficult to implement efficient and correct hashing.
We want efficient hashing, because hashing underlies operations like putting stuff in a Set or Dict, which users expect to be fast.

The issue is

Julia requires that isequal(a, b) === isequal(hash(a, x), hash(b, x))
isequal ought to allow objects of different types to be equal if it represents the same value. E.g. isequal(0, 0.0). Breaking this leads to lots of confusion.

So, to follow these two rules

Different subtypes of BioSequence should be isequal if they have the same content, i.e. isequal(dna"TAG", DNAKmer("TAG"))
Which implies they should hash equally,

Now, how do we get two BioSequences with arbitrary encoding to hash equivalently? As I see it, it means we can't hash the encoded data, because the encoded data may vary between subtypes. However, this is presumably the only way to avoid decoding, which is the only way to make hashing fast!

Incidentally, the current implementation of hash for LongSequence is broken:

julia> (a, b) = (LongDNA{4}("A"), LongDNA{2}("A"));

julia> a in [b] # because they are equal
true

julia> a in Set([b]) # hashes wrong
false

Here are some possible solutions:

Just implement hashing by hashing every element. This will make hashing significantly slower (in tests, ~75 times slower), but it will be simple and correct.
Make one encoding privileged, e.g. LongSequence's. When hashing any other type, we re-code it to that encoding before hashing. This will keep LongSequence hashing fast, but will make everything else both slower and much more complex. If we do this, we also need to recode LongSequence{NucleicAcidAlphabet{2}}.

There may be other solutions.

The text was updated successfully, but these errors were encountered:

kescobo · 2022-07-05T13:52:50Z

Incidentally, the current implementation of hash for LongSequence is broken

😬 This is not good. We might want a separate issue to get a bug-fix on that specifically.

Here are some possible solutions:

Regarding this, I have a question - is changing the content of a hash breaking, or only the behavior. That is, as long as we fulfill the hashing contract(s), if we change the underlying behavior to improve speed, is that breaking?

If the answer is no, I propose that we do (1) as a stop-gap to fix the current bug, since it's simple, and then we can work on further optimization separately.

isequal ought to allow objects of different types to be equal if it represents the same value. E.g. isequal(0, 0.0). Breaking this leads to lots of confusion.

Isn't there some nuance about the difference between == and isequal? Do we need to worry about that?

Another (possibly orthogonal) question - do we want a kmer of a sequence to be isequal with a LongDNA of the same sequence? What about an RNA with the same bases? (I'm not certain on the former, I think probably not on the later).

I am interested in this topic because I remain interested in MinHash and the things that it enables (there's a bunch of neat microbiome work on this by Titus Brown), but I'm not super knowledgeable about the computer science of hashing. Happy to be involved though.

jakobnissen · 2022-07-05T14:10:08Z

I'm pretty sure changing the value of a hash is not breaking. That's fine. Yeah, let's do option 1 as a stop-gap.

W.r.t the difference between == and isequal - yep, you're right, that's a different option: We could have different sequence types with semantially identical content not be isequal. isequal controls:

The behaviour of in (okay this is more complex, but it OUGHT to control the behaviour of in)
The behaviour of sets and dicts
The behaviour of sorting

So if we choose that option, it implies that we can have:

julia> DNAKmer{"ATG"} == dna"ATG"
true

julia> DNAKmer("ATG") in [dna"ATG"]
false

which is also weird.

Current hashing behaviour hashes equal biosequences with different encodings to different values, which violate the interface of hash. This is a fix until (and if) we figure out a better method of hashing. See issue BioJulia#243

Current hashing behaviour hashes equal biosequences with different encodings to different values, which violate the interface of hash. This is a fix until (and if) we figure out a better method of hashing. See issue #243

jakobnissen · 2022-07-08T09:32:16Z

Okay here are some before/after benchmarks:

length  before  after
25     20 ns    69 ns
10000   0.80 us 22.04 us

So 3x for short seqs, ~25x for long sequences. Sigh.

To speed it up, the only way I can think of is this:

For generic BioSequence, we can't do much, so just keep behaviour
LongSequence is canonical, reintroduce old behaviour. For nucleic acids, only those with 4-bit alphabets are canonical.
For 2-bit LongSequences, we can do clever bit-tricks to encode their data elements on the fly to 4-bit encodings, 32 bits at a time.
For all other biosequences (currently SubSequence and Kmer), we implement an iterator which streams the content of the sequence in a series of UInt64 which correspond to the encoding of the equivalent LongSequence. E.g. for SubSeq, this is simply the existing encoding, but bitshifted to remove the offset. For Kmer, this involves bitshifting, then reversing the elements within each 64-bit chunk.

It's doable, but quite a bit of work.

kescobo · 2022-07-08T17:10:50Z

And all of this would have to be done at the same time to maintain the contract? That is, we can't deal with point 2 first, then handle point 3 and point 4 later?

jakobnissen · 2022-07-16T13:52:46Z

Unfortunately, all have to be implemented simultaneously

ian-small · 2022-08-06T04:00:01Z

The problem with the suggestion above is that the only BioSequences I can imagine anyone wanting to hash in large numbers (and thus where the performance really matters) are Kmers. Hashing of Kmers needs to be really fast.

jakobnissen · 2022-08-06T06:27:43Z

Excellent point, you are right. We should do the iterator thing above, but prioritise kmers - that should be extremely fast.

TransGirlCodes · 2022-09-03T22:59:43Z

Current hashing behaviour of Kmers is:

# TODO: Ensure this is the right way to go.
# See https://github.com/BioJulia/BioSequences.jl/pull/121#discussion_r475234270
Base.hash(x::Kmer{A,K,N}, h::UInt) where {A,K,N} = hash(x.data, h ⊻ K)

Which so it just hashes the underlying Tuple. which is implemented like so:

hash(t::Tuple, h::UInt) = hash(t[1], hash(tail(t), h))
function hash(t::Any32, h::UInt)
    out = h + tuplehash_seed
    for i = length(t):-1:1
        out = hash(t[i], out)
    end
    return out
end

TransGirlCodes · 2022-09-03T23:28:45Z

Also maybe the descisions we make here should be added to the docs on the definition of a BioSequence, as it's really important we maintain this contract if people are going to write generic code and mix packages.

TransGirlCodes · 2022-09-03T23:40:27Z

Trying to think of why this hasnt been a big problem for me already in practical setting, and the only answer I can think of is when use a sequence type with a given alphabet I use it for my entire scripts / application, and given Set and Dict are parametric on the sequence type, any hashing has all been with one consistent type, and so one partiular hashing scheme.

kescobo · 2022-09-14T16:23:32Z

From discussion on uBioinfo Slack and @ctb:

yes, you were active on this issue! marbl/Mash#27
it’s pretty easy if you have murmurhash! sourmash and mash use the same approach (mmh, seed=42) and just select the hashes differently for FracMinHash (sourmash) and MinHash (mash and sourmash),
you can see an implementation of the key hashing code here https://github.com/sourmash-bio/sourmash/blob/latest/utils/compute-dna-mh-another-way.py
and I’ll start a new sourmash issue to track this discussion, too.

kescobo · 2022-09-14T16:54:30Z

On further consideration and re-reading the previous issue, this isn't super relevant for this issue - I don't think we want to pull in murmurhash as a dependency, and to be fully compatible we'd need to hash the ASCII, which doesn't make sense for Base.hash. It should be easy to put the compatibility functionality into a separate package.

jakobnissen · 2022-09-14T19:22:30Z

BioSequences already implement murmur3 in the hash.jl file :)

TransGirlCodes · 2022-09-17T00:02:11Z

So I've been trying to think about this for kmers, and it looks like on the fly translation of bit patterns seems sensible, and it makes sense in that promtion rules of two different alphabets - a two bit and a four bit one favour the alphabet that covers the larger set of symbols.

I'm struggling to come up with an elegant set of bit flips to do the 2bit -> 4 bit translation. The best I have so far is these two ideas:

const fourbitnucs = (UInt64(1), UInt64(2), UInt64(4), UInt64(8))

function translate_bits(bits::UInt64, from::DNAAlphabet{2}, to::DNAAlphabet{4})
    bits = bits
    tbits = zero(UInt64)
    @inbounds for i in 0:31
        element = (bits >> (i * 2)) & UInt64(3)
        newelement = foutbitnucs[element + one(UInt64)]
        tbits = (tbits << (i * 4)) | newelement
    end
end

function translate_bits2(bits::UInt64, from::DNAAlphabet{2}, to::DNAAlphabet{4})
    bits = bits
    tbits = zero(UInt64)
    @inbounds for i in 0:31
        element = bits & UInt64(3)
        newelement = fourbitnucs[element + one(UInt64)]
        tbits = (tbits << 4) | newelement
        bits = bits >> 2
    end
    return tbits
end

I'm not a fan of the loop though, I was hoping for a clever combination of shifts and masks as in the bit-parallel counting code.

EDIT: These functions are also wrong, as two UInt64's should be output for the single input UInt64, but you get where my thought processes were going with the translation.

EDIT EDIT:
There is a 3rd option using a lookup table that accepts a UInt8 of 2-bit packed elements, and returns a UIn16 of those elements packed into 4 bits. This might be thebest option as indexing can be cheap, and by using whole bytes we can unpack 4 2-bit elements at once rather than how the functions above do one at a time, whilst keeping the lookup table fairly small.

kescobo · 2022-09-19T13:34:38Z

A lookup table also has the advantage that plebs like me can understand it 😆 (bit flipping still seems like black magic to me)

jakobnissen · 2022-10-03T08:57:43Z

Here are some relevant bitpacking code:

@inline function four_to_two_bit(x::UInt64)::UInt32
    m1 = 0x1111111111111111
    m2 = m1 << 1
    m3 = m1 | m2
    y  = (x >> 3) & m1
    y |= (x >> 2) & m2
    y |= (x >> 1) & m3
    pack(y)
end

@inline function pack(x::UInt64)::UInt32
    m1 = 0x0f0f0f0f0f0f0f0f
    m2 = 0x00ff00ff00ff00ff
    m3 = 0x0000ffff0000ffff
    m4 = 0x00000000ffffffff
    x = (x & m1) | (x & ~m1) >> 2
    x = (x & m2) | (x & ~m2) >> 4
    x = (x & m3) | (x & ~m3) >> 8
    x = (x & m4) | (x & ~m4) >> 16
    x % UInt32
end

@inline function two_to_four_bits(x::UInt32)::UInt64
    m = 0x1111111111111111
    y = expand(x)
    z = (y & m) << 1 | (m & ~y)
    m2 = y & (m << 1)
    m2 = m2 | m2 >> 1
    (z & m2) << 2 | (z & ~m2)
end

@inline function expand(x::UInt32)::UInt64
    m1 = 0x000000000000ffff
    m2 = 0x000000ff000000ff
    m3 = 0x000f000f000f000f
    m4 = 0x0303030303030303
    y = x % UInt64
    y = (y & m1) | (y & ~m1) << 16
    y = (y & m2) | (y & ~m2) << 8
    y = (y & m3) | (y & ~m3) << 4
        (y & m4) | (y & ~m4) << 2
end

kescobo · 2022-10-03T18:36:50Z

Like I said... 🧙 🪄

jakobnissen · 2023-09-25T19:07:19Z

I'm thinking the only solution is to just define isequal(::Kmer, ::BioSequence) = false and then let kmers do its own hashing thing.

jakobnissen · 2024-10-21T07:44:28Z

For the next breaking change, we might want to reconsider what equality means for biosequences:

In the light of the need for fast hashing
In the light of RNA / DNA similarity (i.e. is DNA_A equal to RNA_A? What about RNA_U and DNA_T)?

Currently I lean towards erroring on == between BioSequences, unless explicitly implemented.

jakobnissen mentioned this issue Jul 5, 2022

Optimise hashing BioJulia/BioSymbols.jl#51

Merged

jakobnissen mentioned this issue Jul 5, 2022

Use naive hash fallback for BioSequence #244

Merged

jakobnissen mentioned this issue Jul 8, 2022

CenteredSequence{A} type for comparing short reads padding with AA_Gap #247

Open

ctb mentioned this issue Sep 15, 2022

documentation and support for other implementations of MinHash and FracMinHash sourmash-bio/sourmash#2284

Open

5 tasks

TransGirlCodes mentioned this issue Sep 18, 2022

Fix hashing BioJulia/Kmers.jl#28

Closed

kescobo mentioned this issue Jan 3, 2024

Breaking changes for v1 BioJulia/Kmers.jl#35

Merged

9 tasks

jakobnissen added the breaking label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design of hash(::BioSequence). What to do? #243

Design of hash(::BioSequence). What to do? #243

jakobnissen commented Jul 5, 2022 •

edited

Loading

kescobo commented Jul 5, 2022

jakobnissen commented Jul 5, 2022

jakobnissen commented Jul 8, 2022

kescobo commented Jul 8, 2022

jakobnissen commented Jul 16, 2022

ian-small commented Aug 6, 2022

jakobnissen commented Aug 6, 2022

TransGirlCodes commented Sep 3, 2022 •

edited

Loading

TransGirlCodes commented Sep 3, 2022

TransGirlCodes commented Sep 3, 2022

kescobo commented Sep 14, 2022

kescobo commented Sep 14, 2022

jakobnissen commented Sep 14, 2022

TransGirlCodes commented Sep 17, 2022 •

edited

Loading

kescobo commented Sep 19, 2022

jakobnissen commented Oct 3, 2022

kescobo commented Oct 3, 2022

jakobnissen commented Sep 25, 2023

jakobnissen commented Oct 21, 2024

Design of hash(::BioSequence). What to do? #243

Design of hash(::BioSequence). What to do? #243

Comments

jakobnissen commented Jul 5, 2022 • edited Loading

kescobo commented Jul 5, 2022

jakobnissen commented Jul 5, 2022

jakobnissen commented Jul 8, 2022

kescobo commented Jul 8, 2022

jakobnissen commented Jul 16, 2022

ian-small commented Aug 6, 2022

jakobnissen commented Aug 6, 2022

TransGirlCodes commented Sep 3, 2022 • edited Loading

TransGirlCodes commented Sep 3, 2022

TransGirlCodes commented Sep 3, 2022

kescobo commented Sep 14, 2022

kescobo commented Sep 14, 2022

jakobnissen commented Sep 14, 2022

TransGirlCodes commented Sep 17, 2022 • edited Loading

kescobo commented Sep 19, 2022

jakobnissen commented Oct 3, 2022

kescobo commented Oct 3, 2022

jakobnissen commented Sep 25, 2023

jakobnissen commented Oct 21, 2024

jakobnissen commented Jul 5, 2022 •

edited

Loading

TransGirlCodes commented Sep 3, 2022 •

edited

Loading

TransGirlCodes commented Sep 17, 2022 •

edited

Loading