-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easy performance optimizations for MegaMash #56
Comments
I was thinking maybe a single map would be most efficient of kmer -> target. That would be the single best improvement, I think. |
Idea 1: change how you generate standardized DNA sequences. Considering the numerical value of the whole string is kind of inefficient, but if that's how you want to do it, you should approach this differently. Before taking the reverse complement of the sequence, loop through the string and add up both the value of the base and the value of the complement to the base at that position. Since addition is commutative, it doesn't matter that you haven't reversed it at this point. If the non-complement value is lower, return the string as-is (saving two iterations of the string - since you don't have to calculate the reverse complement or sum it). Otherwise, calculate the complement and return it (still saves one iteration). |
I forgot to send that earlier - the map idea seems interesting. Why do you have the multiple maps, anyways? |
Was easier to implement, didn't know it would be a limiting factor
I would imagine |
Yes, but isn't that already the case? (I'm assuming "alphabetically lesser string" sums up the string and returns the one with the overall lesser value) As an aside, if this is how it does it, can't we just compare the first base instead? The way I'm reading the code, it doesn't actually matter if the string is alphabetically lesser. It just needs to be a single deterministic representation. |
No, that is not the case. It does not sum the string and return the one with the lesser value. It sorts alphabetically the two strings. |
Oh, so it already does exactly what I was suggesting. Ignore my suggestion, then. The kmer -> target map seems like a good idea. I'm going to think about an efficient way to implement it in CUDA |
Would love to learn about what you come up with! Would be great to have a fast megamash algorithm. Thinking more about the minimizers from minimap2, it might be useful to limit the quantity of possible matches for a given sequence. |
One question: if you have say N kmers and you're checking them against a bunch of sequences, do you care about which sequence each kmer is in? If you only care if the kmer was found or not, this would be a lot easier because it'd mean I don't have to care about race conditions |
Yes, you do. This is because you're trying to link a sequence's kmers to a set of unique kmers of sequences that you're searching upon. I'm curious what would start a race condition though |
Based on your code, it looks like you're checking what fraction of the kmers are present in each sequence, right? When you compile code to increment a variable, it doesn't end up being a single instruction. Each thread needs to grab the value of the variable from memory, increment it, and write it back to memory. If multiple threads do this at the same time, they can overwrite it and you'll loose some fraction of the kmers that were actually there. There's definitely a way I could implement this to avoid the race condition, but it'll take some thinking. |
That is true. The incrementing at the end can cause some problems... |
Hey - I've done some thinking and I've got a plan for how to implement this in CUDA. Let me know if you have any issues with it. The plan:
This works well for up to 256 input sequences, and for more than that I'd just run the function multiple times. What are your thoughts? Especially on the bloom filter, since I know you avoided that when writing the algorithm. |
I'd like to know your thoughts on #64 first actually - there are certain conditions where megamash simply fails. |
I think the problem is still the fact that there is a need for longer kmer pair interactions. Ie, in those conditions where megamash fails. I think there is actually a different way of thinking about the algorithm - instead of comparing the target sequences to the reference kmers, compare the reference kmers to the target sequences. Each reference kmer would be a list of hash groups (usually of size 1, but sometimes of a larger size, to compensate for longer range interactions). This complicates things a little, because I think bloom filters fit worse (you can't do the upfront computation), but also allows for the long range interactions. Thoughts? |
Cont at #64 |
I'm currently working on a CUDA implementation for MegaMash, and as I'm re-implementing it I'm finding ways you could make it more efficient in GO. I'll throw them in this thread as I think of them.
The text was updated successfully, but these errors were encountered: