Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megamash cannot call complicated sequences #64

Open
Koeng101 opened this issue Feb 15, 2024 · 10 comments
Open

Megamash cannot call complicated sequences #64

Koeng101 opened this issue Feb 15, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@Koeng101
Copy link
Owner

seq_recovery=0.4152_0 for example, cannot be processed into a megamash table because it doesn't have any unique 16mers. This is because of a flawed assumption: ESPECIALLY in protein variant libraries, 16mer windows won't necessarily be unique - and even if a sequence is unique, a 16mer sliding window may not be able to pick it up. This is a real issue.

I'm still trying to figure out how to fix this.

templateMap.csv

@CamelCaseCam
Copy link

CamelCaseCam commented Feb 15, 2024

I'm just throwing out ideas here, but what if you design a hash that takes into account the history of the sequence, but with less and less importance the further you move further along? I'm thinking something like the following:

hash_n = hash(chars[n]) + (hash_n-1 >> 1)

I think something like this would take much larger blocks of nucleotides into account, while making it so that nucleotides that are farther back don't have as much of an effect (so the core idea of the algorithm should still work)

If you decided to go with something like this, I think an even better solution would be combining k-mers and this rolling hash. You'd just have to check twice as many hashes, which may not be a problem with a bloom filter + CUDA

@Koeng101
Copy link
Owner Author

If you decided to go with something like this, I think an even better solution would be combining k-mers and this rolling hash. You'd just have to check twice as many hashes, which may not be a problem with a bloom filter + CUDA

The thing is that the importance does not diminish as you go along - if anything, it would increase. So I don't think a rolling hash is the answer.

I think the answer might be in suffix arrays. Ie, bebop/poly#42 and bebop/poly#15 , but applied to this problem.

@CamelCaseCam
Copy link

That was my bad - I forgot to include the masking. You'd obviously need to mask it so that the least significant bit disappears. I'm actually testing this right now because I feel like it could still work

@CamelCaseCam
Copy link

Okay I tested it and it does converge back to the same hash after a certain number of characters. Here's my test script - converged after an average of 32 chars, maximum 47 chars out of 100.

from hash import hash_string
import random
import string

def test_hash():
    # Generate a random string of length 10
    s = ''.join(random.choices(string.ascii_letters, k=100))
    # Store the hashes of the original string
    orig_hashes = hash_string(s)
    # Mutate the first letter of the string
    s = (random.choice(string.ascii_letters)) + s[1:]
    # Store the hashes of the mutated string
    mut_hashes = hash_string(s)
    # Compare the hashes and count how long it takes for them to become equal
    equal_at = None
    for i, (o, m) in enumerate(zip(orig_hashes, mut_hashes)):
        if o == m:
            equal_at = i
            break

    return equal_at if equal_at is not None else len(s)

# Define n and results
n = 1000  # Number of tests to run
results = []  # List to store the results

# Run the test n times
for _ in range(n):
    equal_at = test_hash()
    results.append(equal_at)

# Calculate and print the average, maximum, and minimum amount of time
times = [result for result in results if result is not None]
print(f"Average: {sum(times) / len(times)}")
print(f"Maximum: {max(times)}")
print(f"Minimum: {min(times)}")

@CamelCaseCam
Copy link

hash_string is in another file and is just a dummy hash function:

def hash_string(s):
    hashes = [hash(s[0])]
    for n in range(1, len(s)):
        hashes.append((hash(s[n]) + ((hashes[n - 1] >> 1) & 0x7FFFFFFF))) # clear LSB when adding
    return hashes

@CamelCaseCam
Copy link

After moving to a 64-bit hash, I get an average convergence in 64 chars, maximum in 90. It seems that for a hash of k bits, this approach converges back to the same value after an average of k iterations, maximum of about 1.4k

@Koeng101
Copy link
Owner Author

Hey, the other option I just thought of - don't force the kmer to only occur once. Just make it appear as few times as possible, and any kmer that is chosen, have a system to penalize kmers that match across identifiers too often.

This lets you do the classic bloom filters like you were talking about before, and doesn't require a rolling hash, just some additional upfront computation.

@CamelCaseCam
Copy link

Would that solve the issue of distinguishing between sequences that don't have unique k-mers, though?

@Koeng101
Copy link
Owner Author

It does not.

I think the best option might just be to do two-pass. First, compile all potential templates for a given well (based off of small megamashes from the oligos, which are sufficiently unique), then either do another megamash on top with just the potential templates or just do multiple sequence alignments, and choose the best one.

In that case, you're looking for much smaller oligos, and can usually get a unique kmer. Even in the case that you cannot get a unique kmer, the composition that you're looking for (potential target templates) is sufficiently limited that it doesn't really matter.

@CamelCaseCam
Copy link

Hey - sorry for the delay, I was off on reading week and focusing on seeing family. It sounds like a CUDA implementation of megamash would still be useful, so I'll do that. I'm going to include a flag in the program to use a k-mer hash vs a rolling hash (or both), since I still think the rolling hash could work

Can you send me some test files + the output from megamash on those test files? That way once I finish the algorithm I'll know it's working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants