Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy performance optimizations for MegaMash #56

Closed
CamelCaseCam opened this issue Feb 10, 2024 · 16 comments
Closed

Easy performance optimizations for MegaMash #56

CamelCaseCam opened this issue Feb 10, 2024 · 16 comments
Labels
enhancement New feature or request

Comments

@CamelCaseCam
Copy link

I'm currently working on a CUDA implementation for MegaMash, and as I'm re-implementing it I'm finding ways you could make it more efficient in GO. I'll throw them in this thread as I think of them.

@CamelCaseCam CamelCaseCam added the enhancement New feature or request label Feb 10, 2024
@Koeng101
Copy link
Owner

I was thinking maybe a single map would be most efficient of kmer -> target. That would be the single best improvement, I think.

@CamelCaseCam
Copy link
Author

Idea 1: change how you generate standardized DNA sequences. Considering the numerical value of the whole string is kind of inefficient, but if that's how you want to do it, you should approach this differently.

Before taking the reverse complement of the sequence, loop through the string and add up both the value of the base and the value of the complement to the base at that position. Since addition is commutative, it doesn't matter that you haven't reversed it at this point.

If the non-complement value is lower, return the string as-is (saving two iterations of the string - since you don't have to calculate the reverse complement or sum it). Otherwise, calculate the complement and return it (still saves one iteration).

@CamelCaseCam
Copy link
Author

I forgot to send that earlier - the map idea seems interesting. Why do you have the multiple maps, anyways?

@Koeng101
Copy link
Owner

I forgot to send that earlier - the map idea seems interesting. Why do you have the multiple maps, anyways?

Was easier to implement, didn't know it would be a limiting factor

Since addition is commutative, it doesn't matter that you haven't reversed it at this point.

I would imagine AAAT and AATA would have the same additive effect in this case - or I might be reading this wrong.

@CamelCaseCam
Copy link
Author

CamelCaseCam commented Feb 10, 2024

I would imagine AAAT and AATA would have the same additive effect in this case - or I might be reading this wrong

Yes, but isn't that already the case? (I'm assuming "alphabetically lesser string" sums up the string and returns the one with the overall lesser value)

As an aside, if this is how it does it, can't we just compare the first base instead? The way I'm reading the code, it doesn't actually matter if the string is alphabetically lesser. It just needs to be a single deterministic representation.

@Koeng101
Copy link
Owner

Yes, but isn't that already the case? (I'm assuming "alphabetically lesser string" sums up the string and returns the one with the overall lesser value)

No, that is not the case. It does not sum the string and return the one with the lesser value. It sorts alphabetically the two strings.

@CamelCaseCam
Copy link
Author

Oh, so it already does exactly what I was suggesting. Ignore my suggestion, then. The kmer -> target map seems like a good idea. I'm going to think about an efficient way to implement it in CUDA

@Koeng101
Copy link
Owner

The kmer -> target map seems like a good idea. I'm going to think about an efficient way to implement it in CUDA

Would love to learn about what you come up with! Would be great to have a fast megamash algorithm. Thinking more about the minimizers from minimap2, it might be useful to limit the quantity of possible matches for a given sequence.

@CamelCaseCam
Copy link
Author

One question: if you have say N kmers and you're checking them against a bunch of sequences, do you care about which sequence each kmer is in? If you only care if the kmer was found or not, this would be a lot easier because it'd mean I don't have to care about race conditions

@Koeng101
Copy link
Owner

if you have say N kmers and you're checking them against a bunch of sequences, do you care about which sequence each kmer is in?

Yes, you do. This is because you're trying to link a sequence's kmers to a set of unique kmers of sequences that you're searching upon. I'm curious what would start a race condition though

@CamelCaseCam
Copy link
Author

Based on your code, it looks like you're checking what fraction of the kmers are present in each sequence, right? When you compile code to increment a variable, it doesn't end up being a single instruction. Each thread needs to grab the value of the variable from memory, increment it, and write it back to memory. If multiple threads do this at the same time, they can overwrite it and you'll loose some fraction of the kmers that were actually there.

There's definitely a way I could implement this to avoid the race condition, but it'll take some thinking.

@Koeng101
Copy link
Owner

That is true. The incrementing at the end can cause some problems...

@CamelCaseCam
Copy link
Author

Hey - I've done some thinking and I've got a plan for how to implement this in CUDA. Let me know if you have any issues with it.

The plan:

  1. Implement MegaMashMap as an array of sequences and associated k-mer bloom filters. The bloom filter method introduces a (quantifiable) source of error, but it'll make it much, much easier to transfer the maps to shared memory, which we want since shared memory is on the GPU's cache.
  2. Each block in the kernel is assigned to a specific sequence from the MegaMashMap. The bloom filter is copied into shared memory and each thread compares all the k-mer hashes from some sequence in the input to the bloom filter to see if they're present, with an error rate
  3. Each thread in the block writes the fraction of found hashes for its sequence

This works well for up to 256 input sequences, and for more than that I'd just run the function multiple times. What are your thoughts? Especially on the bloom filter, since I know you avoided that when writing the algorithm.

@Koeng101
Copy link
Owner

I'd like to know your thoughts on #64 first actually - there are certain conditions where megamash simply fails.

@Koeng101
Copy link
Owner

I think the problem is still the fact that there is a need for longer kmer pair interactions. Ie, in those conditions where megamash fails.

I think there is actually a different way of thinking about the algorithm - instead of comparing the target sequences to the reference kmers, compare the reference kmers to the target sequences. Each reference kmer would be a list of hash groups (usually of size 1, but sometimes of a larger size, to compensate for longer range interactions). This complicates things a little, because I think bloom filters fit worse (you can't do the upfront computation), but also allows for the long range interactions.

Thoughts?

@Koeng101
Copy link
Owner

Cont at #64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants