greedymini comments

RagnarGrootKoerkamp · Nov 4, 2024 · 6531cd1 · 6531cd1
1 parent 97a6f64
commit 6531cd1
Showing 1 changed file with 110 additions and 0 deletions.
diff --git a/posts/greedymini.org b/posts/greedymini.org
@@ -0,0 +1,110 @@
+#+title: [WIP] Comments on GreedyMini
+#+filetags: @paper-review minimizers wip
+#+OPTIONS: ^:{} num: num:
+#+hugo_front_matter_key_replace: author>authors
+#+toc: headlines 3
+#+date: <2024-11-04 Mon>
+
+These are some (biased) comments on [cite/title/b:@greedymini]
+[cite:@greedymini], which introduces the =GreedyMini= and =GreedyMini+=
+minimizer schemes.
+
+TODO: Also write some comments on [cite/title/b:@random-mini-density] [cite:@random-mini-density].
+
+* Overview
+
+* Detailed comments
+** Terminology
+A number of terms in the preprint deviate from established terminology (mostly by the
+papers of Zheng and Marçais):
+- Usually =particular density= is used for specific sequences, and =density= is
+  the expected particular density on a random string. Thus, to me, =expected
+  density= parses as the expected value of the expected particular density,
+  which is redundant. Prefer just =density=.
+- In [cite/t:@random-mini-density], /gamechanger/ was introduced instead of the more
+  common /charged window/. But now it's /charged window/ again.
+- In [cite/t:@random-mini-density], /marked positions/ and /markup of $S$/ are
+  introduced, but in the current work those are dropped again.
+
+
+
+
+** Abstract
+(I'm in a pedantic mood --- I'm sorry.)
+
+   #+begin_quote
+   Minimizers is [...] . It is [...]
+   #+end_quote
+   ~Minimizers~ is plural?
+
+
+    #+begin_quote
+    Minimizers that achieve the smallest number of selected k-mers over a random
+    DNA sequence, termed expected density, are desired [...].
+    #+end_quote
+  - =(expected) density= is not the /smallest/ number of selected k-mers.
+  - 'the number of selected k-mers over a random sequence' is missing an
+    'expected' somewhere.
+  - (nit) density is not only defined for DNA sequences.
+
+  #+begin_quote
+  Yet, no method to date exists to generate minimizers that achieve minimum
+  expected density.
+  #+end_quote
+  What does =minimum= mean? Some existing scheme is best. Probably you mean
+  =optimal= (equal to the lower bound) instead? But then you should qualify it
+  with =for all $k, w, \sigma$= or so, since the mod-minimizer is near-optimal
+  when e.g.~$k=w+1$ and $\sigma\to\infty$.
+
+  #+begin_quote
+  Moreover, existing selection schemes fail to achieve low density for values of
+  $k$ and $w$ that are most practical for high-throughput sequencing algorithms
+  and datastructures.
+  #+end_quote
+  You seem to focus mostly on $k \sim w$. In this regime, (double) decycling minimizers,
+  are pretty good? And for $k>w$, the mod-minimizer is pretty good.
+  (In general, I'd argue all existing schemes are already pretty close to the
+  density lower bound, and hence that they all achieve 'low density'.)
+
+
+  #+begin_quote
+  Moreover, we present innovative techniques to transform minimizers [...] to
+  larger $k$ values, [...].
+  #+end_quote
+  This was done before by [cite/t:@asymptotic-optimal-minimizers],
+  called the /naive extension/ (see 3.1.1).
+
+
+  #+begin_quote
+  practical values of $k$ and $w$
+  #+end_quote
+  Can you be more specific? Since you are hiding exponential running times, what
+  about e.g.~$(k, w) \sim (22, 42)$?
+
+  Generally, it seems this method cannot go much beyond $k=15$ since it needs
+  $\sigma^k$ memory?
+
+
+  #+begin_quote
+  both expected and particular densities
+  #+end_quote
+  So far you were using =density= to mean what previous work calls =particular
+  density=, but now you also use =particular density=. Be consistent.
+
+  #+begin_quote
+  densities much lower compared to existing selection schemes
+  #+end_quote
+  Please quantify.
+
+  #+begin_quote
+  We expect =GreedyMini+= to improve the performance of many high-throughput
+  sequencing algorithms and data structures
+  #+end_quote
+  One drawback of =GreedyMini= seems that it uses memory exponential in $k$. For
+  $k>11$ or so, the order will likely not fit in cache, and each processed k-mer
+  requires a read from main memory. Even in the best case, this will limit
+  (single threaded) throughput to around $10ns$ per kmer, over $10\times$ slower
+  than my fast minimizer implementation.
+
+
+#+print_bibliography: