Skip to content

Commit

Permalink
greedymini comments
Browse files Browse the repository at this point in the history
  • Loading branch information
RagnarGrootKoerkamp committed Nov 4, 2024
1 parent 97a6f64 commit 6531cd1
Showing 1 changed file with 110 additions and 0 deletions.
110 changes: 110 additions & 0 deletions posts/greedymini.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#+title: [WIP] Comments on GreedyMini
#+filetags: @paper-review minimizers wip
#+OPTIONS: ^:{} num: num:
#+hugo_front_matter_key_replace: author>authors
#+toc: headlines 3
#+date: <2024-11-04 Mon>

These are some (biased) comments on [cite/title/b:@greedymini]
[cite:@greedymini], which introduces the =GreedyMini= and =GreedyMini+=
minimizer schemes.

TODO: Also write some comments on [cite/title/b:@random-mini-density] [cite:@random-mini-density].

* Overview

* Detailed comments
** Terminology
A number of terms in the preprint deviate from established terminology (mostly by the
papers of Zheng and Marçais):
- Usually =particular density= is used for specific sequences, and =density= is
the expected particular density on a random string. Thus, to me, =expected
density= parses as the expected value of the expected particular density,
which is redundant. Prefer just =density=.
- In [cite/t:@random-mini-density], /gamechanger/ was introduced instead of the more
common /charged window/. But now it's /charged window/ again.
- In [cite/t:@random-mini-density], /marked positions/ and /markup of $S$/ are
introduced, but in the current work those are dropped again.




** Abstract
(I'm in a pedantic mood --- I'm sorry.)

#+begin_quote
Minimizers is [...] . It is [...]
#+end_quote
~Minimizers~ is plural?


#+begin_quote
Minimizers that achieve the smallest number of selected k-mers over a random
DNA sequence, termed expected density, are desired [...].
#+end_quote
- =(expected) density= is not the /smallest/ number of selected k-mers.
- 'the number of selected k-mers over a random sequence' is missing an
'expected' somewhere.
- (nit) density is not only defined for DNA sequences.

#+begin_quote
Yet, no method to date exists to generate minimizers that achieve minimum
expected density.
#+end_quote
What does =minimum= mean? Some existing scheme is best. Probably you mean
=optimal= (equal to the lower bound) instead? But then you should qualify it
with =for all $k, w, \sigma$= or so, since the mod-minimizer is near-optimal
when e.g.~$k=w+1$ and $\sigma\to\infty$.

#+begin_quote
Moreover, existing selection schemes fail to achieve low density for values of
$k$ and $w$ that are most practical for high-throughput sequencing algorithms
and datastructures.
#+end_quote
You seem to focus mostly on $k \sim w$. In this regime, (double) decycling minimizers,
are pretty good? And for $k>w$, the mod-minimizer is pretty good.
(In general, I'd argue all existing schemes are already pretty close to the
density lower bound, and hence that they all achieve 'low density'.)


#+begin_quote
Moreover, we present innovative techniques to transform minimizers [...] to
larger $k$ values, [...].
#+end_quote
This was done before by [cite/t:@asymptotic-optimal-minimizers],
called the /naive extension/ (see 3.1.1).


#+begin_quote
practical values of $k$ and $w$
#+end_quote
Can you be more specific? Since you are hiding exponential running times, what
about e.g.~$(k, w) \sim (22, 42)$?

Generally, it seems this method cannot go much beyond $k=15$ since it needs
$\sigma^k$ memory?


#+begin_quote
both expected and particular densities
#+end_quote
So far you were using =density= to mean what previous work calls =particular
density=, but now you also use =particular density=. Be consistent.

#+begin_quote
densities much lower compared to existing selection schemes
#+end_quote
Please quantify.

#+begin_quote
We expect =GreedyMini+= to improve the performance of many high-throughput
sequencing algorithms and data structures
#+end_quote
One drawback of =GreedyMini= seems that it uses memory exponential in $k$. For
$k>11$ or so, the order will likely not fit in cache, and each processed k-mer
requires a read from main memory. Even in the best case, this will limit
(single threaded) throughput to around $10ns$ per kmer, over $10\times$ slower
than my fast minimizer implementation.


#+print_bibliography:

0 comments on commit 6531cd1

Please sign in to comment.