Skip to content

Commit

Permalink
comments on brisk
Browse files Browse the repository at this point in the history
  • Loading branch information
RagnarGrootKoerkamp committed Nov 29, 2024
1 parent e431d57 commit c54df9b
Show file tree
Hide file tree
Showing 2 changed files with 234 additions and 0 deletions.
194 changes: 194 additions & 0 deletions posts/brisk.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
#+title: Comments on Brisk
#+filetags: @paper-review minimizers
#+OPTIONS: ^:{} num: num:
#+hugo_front_matter_key_replace: author>authors
#+toc: headlines 3
#+date: <2024-11-29 Fri>

These are some (biased) comments on Brisk,
a dynamic k-mer dictionary [cite:@brisk].

* Overview
As is common these days, Brisk builds a dynamic k-mer dictionary using
super-kmers, like e.g. SSHash [cite:@sshash].
1. It uses double-decycling minimizers for their low density.
2. Super-kmers are clustered on their minimizer.
3. A /bucket/ is the set of super-kmers with the same minimizer.
4. New: To store a bucket, super-kmers are written in 'interleaved' form:
=CBA___XYZ= (with =___= being the minimizer) is stored as =___AXBYCZ=.
5. New: Even better, the minimizer itself is omitted, so only =AXBYCZ= is stored.
6. For super-kmers with non-maximal length, like =A___XY=, =N= characters (I'll use =*= for
clarity) are
used to fill gaps: =**A___XY*=, which is stored as =AX*Y**=.
7. To search a bucket, we can usually narrow the linear scan over super-kmers to
only those sharing a prefix up to the same =N=.
8. Minimizers are hashed using a bijective hash function, and clustered into
superbuckets. While minimizer-counts have a very skew distribution, these
superbuckets have more uniform sizes.

* Detailed comments
** General
- A number of paragraphs are missing the final =s= for nearly all verbs and/or
have multiple ungrammatical sentences.
- All citations are currently shown as =\citet=, but instead, most ought to be =\citep=.
- Nothing is said about (not) supporting deletion queries.
- (nit: I generally don't see the point in uploading preprints with line numbers enabled.)
- No DOIs in references :(
** Abstract
- What's the $\mathfrak N$-like mark in the author list?
- exceptional throughput: This claim cannot be made without being more precise:
- construction times are not faster than any of the competitors
- query throughput is not at all compared to other methods
- /drop in replacement/: but deletions are not supported, and it's called a proof of concept.
** 1. Introduction
- Footnote 1 runs off the page.
- $O(N.K)$ use $k$ instead, and $\cdot$ instead of the dot.
- What /exactly/ does it mean to use a k-mer dictionary in 'streaming' way?
- It is mentioned CBL can use a lot of memory in the worst case because it does
not build an SPSS, but neither does Brisk, in my understanding.
- /k-mer-level dynamism/: it is unclear to me what exactly is meant by this. To
me it would imply that individual k-mers can also be removed, but that is not
the case.
** 2. Methods
*** 2.1 Outline
- /can be built in streaming/: missing word?
- /only encoding the prefixes and suffixes before and after the minimizer
[cite:@fractional-hitting-sets]./: After a brief skim, I did not find this idea mentioned in the
cited paper.
- $O(S)$ should be $O(|S|)$
- What is /large minimizers/ referring to?
*** 2.2 Indexing super-k-mers
- /A simple scheme based on hashed $m$-mers achieves a near-optimal density of
$2/(k-m+1)$ [cite:@modmini]/.
- The sentence seems to refer to the random minimizer, which has a density of
$2/(k-m+2)$ instead, which is not near-optimal.
- The cited paper introduces the mod-minimizer, which is near-optimal in
specific cases, but has a completely different density.
- It should be $(k-m+2)/2$ k-mers per super-kmer. Probably from here onward,
every $k-m+1$ should be a $k-m+2$ instead.
- /In figure 1a, we show the mean super-k-mer size that can be obtained for
standard values of $k$ and $m$, and observe that practical results closely
match this approximation./
- The figure does not show the theoretical prediction.
- $2(3k-m-1)/(k-m+1)$: it is unclear to me where this comes from.
- In my experience, decycling minimizers are slower to compute then
other minimizer schemes, but indeed, they do have low density.
- Figure 1b confuses me:
- in my experience, (double) decycling is never worse than random minimizers,
while in the figure decycling sometimes /is/ worse.
- the caption writes /Mean difference in ... and hashed minimizer strategies./
Should probably be singular /strategy/?
- While comparing to random minimizers is nice, really it would be better to
compare to some more of the schemes that are mentioned.
- it would be much more clear to simply show a plot of density of the two
schemes directly, rather than just the difference.
- /Since these minimizers can be selected in streaming with minimal
computational overhead/: the original decycling set paper
[cite:@minimum-decycling-set] does not provide code for streaming computation
of the minimizers. It can be done, but it should probably be remarked how.
*** 2.3 Lazy encoding
- A brief quantitative comparison between the various bits-per-k-mer ratios in
the paper would be beneficial to understand the tradeoff between the bits
saved by not having to encode the minimizer, and the bits lost by encoding
maximal k-mers only.
- the reverse-complement of a minimizer is not necessarily a minimizer of the
reverse-complement sequence, unless special care is taken. This seems to be
assumed though.
- How does 'only consider canonical m-mers' interact with decycling minimizers?
I could see this requirement causing the decycling minimizers to behave much
worse than expected.
- A detailed worked example of this process would be beneficial, as many
papers skim details on reverse complements, and so a proposed solution should
be very precise.
- Are there issues when the minimizer of consecutive k-mers is different due to
the minimizer changing strands? Is the scheme still /forward/?
*** 2.4 Probing
- 'given a given k-mer'
- 'A set of k-merS'
- 'in not trivial' => 'is not trivial'
- 'super-k-mer' => 'super-k-merS' a few times?
- 'sorting super-k-mers as it is not a good idea': ungrammatical
- It feels like all trailing =s='s were dropped here (/it give most importance/,
/this seem irrelevant/).
- The text first extends the minimizer on the left, then right, and then
alternates. Figure 3 and 4 do the opposite and first extend right and then left.
- I don't think anything is said about how =N= characters are encoded/compared
in practice? Using additional (masking) bits for this seems space inefficient?
- /we chose the base that base that are the less likely to be a N./ This is
unclearly worded, but it seems to imply that a single fixed character of
=ACTG= is used as =N=? How is it chosen? In the incremental setting, the
least-occurring character in the dataset may change over time. Probably it
does not depend on the super-kmer?
- 'this property do not grant'
- Figure 4 shows /the result of/ a binary search, not the steps of a binary
search itself. (I'm assuming the boundaries of each shown block are bound
using an individual binary search.)
- 'and end it .' => 'and ends it.'
*** 2.5 Superbuckets
- Figure 6 is terrible:
- The legend is confusing. It should be sorted.
- The y-axis is most likely logarithmic, but this is never mentioned.
- The y-axis label should be 'bucket count', not 'bucket number'.
- The x-axis appears to be labelled as $\log_4(bucket size)$??? Is bucket size
$0$ really $0$, or $1$? Are buckets with sizes $[2^i, 2^{i+1})$ batched together?
- Caption is missing spaces around 11.
- The text refers to $2^7$ and $2^{10}$, it appears that this is for $m=13$, but
this is not mentioned.
- 'bucketsPibiri'
- /C. elegans generates more very small buckets than teh random sequence./ To me
it seems to be the opposite, although the difference is small anyway.
- /Since the problem lies in the non-uniform distribution of minimizers, a
simple solution is to use a hash function to achieve a uniform distribution./
- It is unclear to me what this achieves. Permuting the minimizer buckets
using a bijective hash keeps the distribution of sizes the same.
- Since non-lexicographic minimizers are used, there should be little/no
correlation between bucket sizes of lexicographically-close minimizers.
So hashing the minimizers shouldn't be needed anyway?
- /using a surjective function would .. allow hash collisions/
- This should say /non-injective/ or so. The hash function being
surjective or not is really not important here.
- Figure 5: the =AA:= is weirdly line wrapped.
- Figure 6 is referenced again, but this seems to have nothing to do with the
current text. It appears the correct figure is missing.
- Unfortunately, it is hard to appreciate the impact/usefulness of the superbuckets
without this missing figure.
- It is unclear what is the benefit of having a smoother super-bucket size
distribution. Isn't it more beneficial to simply have $4^m$ smaller buckets directly?

*** 2.6 Implementation details
- It is unclear how minimizers are mapped to their bucket.
- How are duplicate k-mers dealt with? What if a k-mer occurs in multiple
super-kmers? How is a canonical location chosen, especially after new
super-kmers containing the same k-mer are inserted?
- Some (pseudo)-code would go a long way to explain what is going on at a high level.
- Does every sort trigger inserting the buffered super-kmers into the sorted
list? Doesn't that move the entire list? Triggering linear time behaviour on
every insert/sort.

** 3. Results
*** 3.1 Parameters
- Figure 7:
- what is $m$?
- For $b=17$, memory usage goes up to 500GB, but the benchmark machine only has 128GB.
This really needs a remark on virtual memory pages, but rather the real
memory usage should be shown instead of virtual memory usage.
*** 3.2 Multicore
- 'the dictionary': it was never explained what the main dictionary is
- 'substructures': too vague to understand
*** 3.4 Comparison
- What are $b$ and $m$?
*** 3.5 Query times
- nit: Random queries is not exactly the same as negative queries.
- Why is query throughput not compared to other methods?
- Fig 11: axes labels are too small.
** 4. Conclusion
- Either Brisk is a /proof of concept/, or it's directly usable replacement for
existing k-mer dictionaries. Not both.
- /state-of-the-art throughput/: again, query throughput was not compared.
- Why can't $k$ be $64$? It's not required to be odd.
- /any empty position with [a super-k-mer] are never filled/: I do not
understand what the empty positions refer to.



#+print_bibliography:
40 changes: 40 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -3697,3 +3697,43 @@ @Article{fast-mod
url = {http://dx.doi.org/10.1002/spe.2689},
publisher = {Wiley}
}
@Article{hyper-kmer,
author = {Martayan, Igor and Robidou, Lucas and Shibuya, Yoshihiro and
Limasset, Antoine},
title = {Hyper-k-mers: efficient streamingk-mers representation},
year = 2024,
month = nov,
doi = {10.1101/2024.11.06.620789},
url = {http://dx.doi.org/10.1101/2024.11.06.620789},
publisher = {Cold Spring Harbor Laboratory}
}

@InProceedings{fractional-hitting-sets,
author = {Rouzé, Timothé and Martayan, Igor and Marchet, Camille and
Limasset, Antoine},
title = {Fractional Hitting Sets for Efficient and Lightweight Genomic
Data Sketching},
year = 2023,
language = {en},
publisher = {Schloss Dagstuhl – Leibniz-Zentrum für Informatik},
doi = {10.4230/LIPICS.WABI.2023.15},
url =
{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.15},
keywords = {k-mer, subsampling, sketching, Jaccard, containment,
metagenomics, Applied computing → Bioinformatics},
copyright = {Creative Commons Attribution 4.0 International license}
}


@article {brisk,
author = {Smith, Caleb and Martayan, Igor and Limasset, Antoine and Dufresne, Yoann},
title = {Brisk: Exact resource-efficient dictionary for k-mers},
elocation-id = {2024.11.26.625346},
year = {2024},
doi = {10.1101/2024.11.26.625346},
publisher = {Cold Spring Harbor Laboratory},
abstract = {The rapid advancements in DNA sequencing technology have led to an unprecedented increase in the generation of genomic datasets, with modern sequencers now capable of producing up to ten terabases per run. However, the effective indexing and analysis of this vast amount of data pose significant challenges to the scientific community. K-mer indexing has proven crucial in managing extensive datasets across a wide range of applications, including alignment, compression, dataset comparison, error correction, assembly, and quantification. As a result, developing efficient and scalable k-mer indexing methods has become an increasingly important area of research. Despite the progress made, current state-of-the-art indexing structures are predominantly static, necessitating resource-intensive index reconstruction when integrating new data. Recently, the need for dynamic indexing structures has been recognized. However, many proposed solutions are only pseudo-dynamic, requiring substantial updates to justify the costs of adding new datasets. In practice, applications often rely on standard hash tables to associate data with their k-mers, leading to high k-mer encoding rates exceeding 64 bits per k-mer. In this work, we introduce Brisk, a drop-in replacement for most k-mer dictionary applications. This novel hashmap-like data structure provides exceptional throughput while significantly reducing memory usage compared to existing dynamic associative indexes, particularly for large k-mer sizes. Brisk achieves this by leveraging hierarchical minimizer indexing and memory efficient super-k-mer representation. We also introduce novel techniques for efficiently probing k-mers within a set of super-k-mers and managing duplicated minimizers. We believe that the methodologies developed in this work represent a significant advancement in the creation of efficient and scalable k-mer dictionaries, greatly facilitating their routine use in genomic data analysis.Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2024/11/28/2024.11.26.625346},
eprint = {https://www.biorxiv.org/content/early/2024/11/28/2024.11.26.625346.full.pdf},
journal = {bioRxiv}
}

0 comments on commit c54df9b

Please sign in to comment.