Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design of hash(::BioSequence). What to do? #243

Open
jakobnissen opened this issue Jul 5, 2022 · 19 comments
Open

Design of hash(::BioSequence). What to do? #243

jakobnissen opened this issue Jul 5, 2022 · 19 comments
Labels

Comments

@jakobnissen
Copy link
Member

jakobnissen commented Jul 5, 2022

So the design of BioSequences makes it difficult to implement efficient and correct hashing.
We want efficient hashing, because hashing underlies operations like putting stuff in a Set or Dict, which users expect to be fast.

The issue is

  • Julia requires that isequal(a, b) === isequal(hash(a, x), hash(b, x))
  • isequal ought to allow objects of different types to be equal if it represents the same value. E.g. isequal(0, 0.0). Breaking this leads to lots of confusion.

So, to follow these two rules

  • Different subtypes of BioSequence should be isequal if they have the same content, i.e. isequal(dna"TAG", DNAKmer("TAG"))
  • Which implies they should hash equally,

Now, how do we get two BioSequences with arbitrary encoding to hash equivalently? As I see it, it means we can't hash the encoded data, because the encoded data may vary between subtypes. However, this is presumably the only way to avoid decoding, which is the only way to make hashing fast!

Incidentally, the current implementation of hash for LongSequence is broken:

julia> (a, b) = (LongDNA{4}("A"), LongDNA{2}("A"));

julia> a in [b] # because they are equal
true

julia> a in Set([b]) # hashes wrong
false

Here are some possible solutions:

  1. Just implement hashing by hashing every element. This will make hashing significantly slower (in tests, ~75 times slower), but it will be simple and correct.
  2. Make one encoding privileged, e.g. LongSequence's. When hashing any other type, we re-code it to that encoding before hashing. This will keep LongSequence hashing fast, but will make everything else both slower and much more complex. If we do this, we also need to recode LongSequence{NucleicAcidAlphabet{2}}.

There may be other solutions.

@kescobo
Copy link
Member

kescobo commented Jul 5, 2022

Incidentally, the current implementation of hash for LongSequence is broken

😬 This is not good. We might want a separate issue to get a bug-fix on that specifically.

Here are some possible solutions:

Regarding this, I have a question - is changing the content of a hash breaking, or only the behavior. That is, as long as we fulfill the hashing contract(s), if we change the underlying behavior to improve speed, is that breaking?

If the answer is no, I propose that we do (1) as a stop-gap to fix the current bug, since it's simple, and then we can work on further optimization separately.

isequal ought to allow objects of different types to be equal if it represents the same value. E.g. isequal(0, 0.0). Breaking this leads to lots of confusion.

Isn't there some nuance about the difference between == and isequal? Do we need to worry about that?

Another (possibly orthogonal) question - do we want a kmer of a sequence to be isequal with a LongDNA of the same sequence? What about an RNA with the same bases? (I'm not certain on the former, I think probably not on the later).


I am interested in this topic because I remain interested in MinHash and the things that it enables (there's a bunch of neat microbiome work on this by Titus Brown), but I'm not super knowledgeable about the computer science of hashing. Happy to be involved though.

@jakobnissen
Copy link
Member Author

I'm pretty sure changing the value of a hash is not breaking. That's fine. Yeah, let's do option 1 as a stop-gap.

W.r.t the difference between == and isequal - yep, you're right, that's a different option: We could have different sequence types with semantially identical content not be isequal. isequal controls:

  • The behaviour of in (okay this is more complex, but it OUGHT to control the behaviour of in)
  • The behaviour of sets and dicts
  • The behaviour of sorting

So if we choose that option, it implies that we can have:

julia> DNAKmer{"ATG"} == dna"ATG"
true

julia> DNAKmer("ATG") in [dna"ATG"]
false

which is also weird.

jakobnissen added a commit to jakobnissen/BioSequences.jl that referenced this issue Jul 5, 2022
Current hashing behaviour hashes equal biosequences with different encodings to
different values, which violate the interface of hash.
This is a fix until (and if) we figure out a better method of hashing.

See issue BioJulia#243
jakobnissen added a commit that referenced this issue Jul 5, 2022
Current hashing behaviour hashes equal biosequences with different encodings to
different values, which violate the interface of hash.
This is a fix until (and if) we figure out a better method of hashing.

See issue #243
@jakobnissen
Copy link
Member Author

Okay here are some before/after benchmarks:

length  before  after
25     20 ns    69 ns
10000   0.80 us 22.04 us

So 3x for short seqs, ~25x for long sequences. Sigh.

To speed it up, the only way I can think of is this:

  • For generic BioSequence, we can't do much, so just keep behaviour
  • LongSequence is canonical, reintroduce old behaviour. For nucleic acids, only those with 4-bit alphabets are canonical.
  • For 2-bit LongSequences, we can do clever bit-tricks to encode their data elements on the fly to 4-bit encodings, 32 bits at a time.
  • For all other biosequences (currently SubSequence and Kmer), we implement an iterator which streams the content of the sequence in a series of UInt64 which correspond to the encoding of the equivalent LongSequence. E.g. for SubSeq, this is simply the existing encoding, but bitshifted to remove the offset. For Kmer, this involves bitshifting, then reversing the elements within each 64-bit chunk.

It's doable, but quite a bit of work.

@kescobo
Copy link
Member

kescobo commented Jul 8, 2022

And all of this would have to be done at the same time to maintain the contract? That is, we can't deal with point 2 first, then handle point 3 and point 4 later?

@jakobnissen
Copy link
Member Author

Unfortunately, all have to be implemented simultaneously

@ian-small
Copy link

The problem with the suggestion above is that the only BioSequences I can imagine anyone wanting to hash in large numbers (and thus where the performance really matters) are Kmers. Hashing of Kmers needs to be really fast.

@jakobnissen
Copy link
Member Author

Excellent point, you are right. We should do the iterator thing above, but prioritise kmers - that should be extremely fast.

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Sep 3, 2022

Current hashing behaviour of Kmers is:

# TODO: Ensure this is the right way to go.
# See https://github.com/BioJulia/BioSequences.jl/pull/121#discussion_r475234270
Base.hash(x::Kmer{A,K,N}, h::UInt) where {A,K,N} = hash(x.data, h  K)

Which so it just hashes the underlying Tuple. which is implemented like so:

hash(t::Tuple, h::UInt) = hash(t[1], hash(tail(t), h))
function hash(t::Any32, h::UInt)
    out = h + tuplehash_seed
    for i = length(t):-1:1
        out = hash(t[i], out)
    end
    return out
end

@TransGirlCodes
Copy link
Member

Also maybe the descisions we make here should be added to the docs on the definition of a BioSequence, as it's really important we maintain this contract if people are going to write generic code and mix packages.

@TransGirlCodes
Copy link
Member

Trying to think of why this hasnt been a big problem for me already in practical setting, and the only answer I can think of is when use a sequence type with a given alphabet I use it for my entire scripts / application, and given Set and Dict are parametric on the sequence type, any hashing has all been with one consistent type, and so one partiular hashing scheme.

@kescobo
Copy link
Member

kescobo commented Sep 14, 2022

From discussion on uBioinfo Slack and @ctb:

yes, you were active on this issue! marbl/Mash#27
it’s pretty easy if you have murmurhash! sourmash and mash use the same approach (mmh, seed=42) and just select the hashes differently for FracMinHash (sourmash) and MinHash (mash and sourmash),
you can see an implementation of the key hashing code here https://github.com/sourmash-bio/sourmash/blob/latest/utils/compute-dna-mh-another-way.py
and I’ll start a new sourmash issue to track this discussion, too.

@kescobo
Copy link
Member

kescobo commented Sep 14, 2022

On further consideration and re-reading the previous issue, this isn't super relevant for this issue - I don't think we want to pull in murmurhash as a dependency, and to be fully compatible we'd need to hash the ASCII, which doesn't make sense for Base.hash. It should be easy to put the compatibility functionality into a separate package.

@jakobnissen
Copy link
Member Author

BioSequences already implement murmur3 in the hash.jl file :)

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Sep 17, 2022

So I've been trying to think about this for kmers, and it looks like on the fly translation of bit patterns seems sensible, and it makes sense in that promtion rules of two different alphabets - a two bit and a four bit one favour the alphabet that covers the larger set of symbols.

I'm struggling to come up with an elegant set of bit flips to do the 2bit -> 4 bit translation. The best I have so far is these two ideas:

const fourbitnucs = (UInt64(1), UInt64(2), UInt64(4), UInt64(8))

function translate_bits(bits::UInt64, from::DNAAlphabet{2}, to::DNAAlphabet{4})
    bits = bits
    tbits = zero(UInt64)
    @inbounds for i in 0:31
        element = (bits >> (i * 2)) & UInt64(3)
        newelement = foutbitnucs[element + one(UInt64)]
        tbits = (tbits << (i * 4)) | newelement
    end
end

function translate_bits2(bits::UInt64, from::DNAAlphabet{2}, to::DNAAlphabet{4})
    bits = bits
    tbits = zero(UInt64)
    @inbounds for i in 0:31
        element = bits & UInt64(3)
        newelement = fourbitnucs[element + one(UInt64)]
        tbits = (tbits << 4) | newelement
        bits = bits >> 2
    end
    return tbits
end

I'm not a fan of the loop though, I was hoping for a clever combination of shifts and masks as in the bit-parallel counting code.

EDIT: These functions are also wrong, as two UInt64's should be output for the single input UInt64, but you get where my thought processes were going with the translation.

EDIT EDIT:
There is a 3rd option using a lookup table that accepts a UInt8 of 2-bit packed elements, and returns a UIn16 of those elements packed into 4 bits. This might be thebest option as indexing can be cheap, and by using whole bytes we can unpack 4 2-bit elements at once rather than how the functions above do one at a time, whilst keeping the lookup table fairly small.

@kescobo
Copy link
Member

kescobo commented Sep 19, 2022

A lookup table also has the advantage that plebs like me can understand it 😆 (bit flipping still seems like black magic to me)

@jakobnissen
Copy link
Member Author

Here are some relevant bitpacking code:

@inline function four_to_two_bit(x::UInt64)::UInt32
    m1 = 0x1111111111111111
    m2 = m1 << 1
    m3 = m1 | m2
    y  = (x >> 3) & m1
    y |= (x >> 2) & m2
    y |= (x >> 1) & m3
    pack(y)
end

@inline function pack(x::UInt64)::UInt32
    m1 = 0x0f0f0f0f0f0f0f0f
    m2 = 0x00ff00ff00ff00ff
    m3 = 0x0000ffff0000ffff
    m4 = 0x00000000ffffffff
    x = (x & m1) | (x & ~m1) >> 2
    x = (x & m2) | (x & ~m2) >> 4
    x = (x & m3) | (x & ~m3) >> 8
    x = (x & m4) | (x & ~m4) >> 16
    x % UInt32
end

@inline function two_to_four_bits(x::UInt32)::UInt64
    m = 0x1111111111111111
    y = expand(x)
    z = (y & m) << 1 | (m & ~y)
    m2 = y & (m << 1)
    m2 = m2 | m2 >> 1
    (z & m2) << 2 | (z & ~m2)
end

@inline function expand(x::UInt32)::UInt64
    m1 = 0x000000000000ffff
    m2 = 0x000000ff000000ff
    m3 = 0x000f000f000f000f
    m4 = 0x0303030303030303
    y = x % UInt64
    y = (y & m1) | (y & ~m1) << 16
    y = (y & m2) | (y & ~m2) << 8
    y = (y & m3) | (y & ~m3) << 4
        (y & m4) | (y & ~m4) << 2
end

@kescobo
Copy link
Member

kescobo commented Oct 3, 2022

Like I said... 🧙 🪄

@jakobnissen
Copy link
Member Author

I'm thinking the only solution is to just define isequal(::Kmer, ::BioSequence) = false and then let kmers do its own hashing thing.

@jakobnissen
Copy link
Member Author

For the next breaking change, we might want to reconsider what equality means for biosequences:

  • In the light of the need for fast hashing
  • In the light of RNA / DNA similarity (i.e. is DNA_A equal to RNA_A? What about RNA_U and DNA_T)?

Currently I lean towards erroring on == between BioSequences, unless explicitly implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants