From 6531cd1716323d628eae00f2caa932ec40cf2533 Mon Sep 17 00:00:00 2001 From: Ragnar Groot Koerkamp Date: Mon, 4 Nov 2024 03:24:32 +0100 Subject: [PATCH] greedymini comments --- posts/greedymini.org | 110 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 posts/greedymini.org diff --git a/posts/greedymini.org b/posts/greedymini.org new file mode 100644 index 0000000..ebaa49b --- /dev/null +++ b/posts/greedymini.org @@ -0,0 +1,110 @@ +#+title: [WIP] Comments on GreedyMini +#+filetags: @paper-review minimizers wip +#+OPTIONS: ^:{} num: num: +#+hugo_front_matter_key_replace: author>authors +#+toc: headlines 3 +#+date: <2024-11-04 Mon> + +These are some (biased) comments on [cite/title/b:@greedymini] +[cite:@greedymini], which introduces the =GreedyMini= and =GreedyMini+= +minimizer schemes. + +TODO: Also write some comments on [cite/title/b:@random-mini-density] [cite:@random-mini-density]. + +* Overview + +* Detailed comments +** Terminology +A number of terms in the preprint deviate from established terminology (mostly by the +papers of Zheng and Marçais): +- Usually =particular density= is used for specific sequences, and =density= is + the expected particular density on a random string. Thus, to me, =expected + density= parses as the expected value of the expected particular density, + which is redundant. Prefer just =density=. +- In [cite/t:@random-mini-density], /gamechanger/ was introduced instead of the more + common /charged window/. But now it's /charged window/ again. +- In [cite/t:@random-mini-density], /marked positions/ and /markup of $S$/ are + introduced, but in the current work those are dropped again. + + + + +** Abstract +(I'm in a pedantic mood --- I'm sorry.) + + #+begin_quote + Minimizers is [...] . It is [...] + #+end_quote + ~Minimizers~ is plural? + + + #+begin_quote + Minimizers that achieve the smallest number of selected k-mers over a random + DNA sequence, termed expected density, are desired [...]. + #+end_quote + - =(expected) density= is not the /smallest/ number of selected k-mers. + - 'the number of selected k-mers over a random sequence' is missing an + 'expected' somewhere. + - (nit) density is not only defined for DNA sequences. + + #+begin_quote + Yet, no method to date exists to generate minimizers that achieve minimum + expected density. + #+end_quote + What does =minimum= mean? Some existing scheme is best. Probably you mean + =optimal= (equal to the lower bound) instead? But then you should qualify it + with =for all $k, w, \sigma$= or so, since the mod-minimizer is near-optimal + when e.g.~$k=w+1$ and $\sigma\to\infty$. + + #+begin_quote + Moreover, existing selection schemes fail to achieve low density for values of + $k$ and $w$ that are most practical for high-throughput sequencing algorithms + and datastructures. + #+end_quote + You seem to focus mostly on $k \sim w$. In this regime, (double) decycling minimizers, + are pretty good? And for $k>w$, the mod-minimizer is pretty good. + (In general, I'd argue all existing schemes are already pretty close to the + density lower bound, and hence that they all achieve 'low density'.) + + + #+begin_quote + Moreover, we present innovative techniques to transform minimizers [...] to + larger $k$ values, [...]. + #+end_quote + This was done before by [cite/t:@asymptotic-optimal-minimizers], + called the /naive extension/ (see 3.1.1). + + + #+begin_quote + practical values of $k$ and $w$ + #+end_quote + Can you be more specific? Since you are hiding exponential running times, what + about e.g.~$(k, w) \sim (22, 42)$? + + Generally, it seems this method cannot go much beyond $k=15$ since it needs + $\sigma^k$ memory? + + + #+begin_quote + both expected and particular densities + #+end_quote + So far you were using =density= to mean what previous work calls =particular + density=, but now you also use =particular density=. Be consistent. + + #+begin_quote + densities much lower compared to existing selection schemes + #+end_quote + Please quantify. + + #+begin_quote + We expect =GreedyMini+= to improve the performance of many high-throughput + sequencing algorithms and data structures + #+end_quote + One drawback of =GreedyMini= seems that it uses memory exponential in $k$. For + $k>11$ or so, the order will likely not fit in cache, and each processed k-mer + requires a read from main memory. Even in the best case, this will limit + (single threaded) throughput to around $10ns$ per kmer, over $10\times$ slower + than my fast minimizer implementation. + + +#+print_bibliography: