Skip to content

Projects Being Developed

cheng-chang edited this page Dec 30, 2020 · 36 revisions

The page lists major projects being actively developed, or has been planned for future development.

RocksDB on Remote Storage

API between RocksDB and underlying storage

We recently completed a major refactoring of the rocksdb::Env class by separating the storage related interfaces into a class of its own, called rocksdb::FileSystem. In the long-term, the storage interfaces in Env will be deprecated and the main purpose of Env will be to abstract core OS functionality that RocksDB needs. The relevant PRs are https://github.com/facebook/rocksdb/pull/5761 and https://github.com/facebook/rocksdb/pull/6552.

Over time, we will implement new functionality enabled by this separation -

  1. Richer error handling - A compliant FileSystem implementation can return information about an IO error, such as whether its transient/retryable, permanent data-loss, file scope or entire file system etc. in IOStatus, which will allow RocksDB to do more intelligent error handling.
  2. Fail fast - For file systems that allow callers to provide a timeout for an IO, RocksDB can provide provide better SLAs for user reads by providing an option to specify a deadline, and failing a Get/MultiGet as soon as the deadline is exceeded. This is an ongoing project.

User Defined Timestamps

https://docs.google.com/document/d/1FcDjOM8-pJzCajCa9waQkox6DKJIWZAHm36buK4wfqs/edit#heading=h.uxub5284i1ti

BlobDB

BlobDB is RocksDB's implementation of key-value separation, originally inspired by the WiscKey paper. Large values (blobs) are stored in separate blob files, and only references to them are stored in RocksDB's LSM tree. By separating value storage from the LSM tree, BlobDB provides an alternative way of reducing write amplification, instead of tuning compactions. BlobDB is used in production at Facebook.

File Checksums

See https://github.com/facebook/rocksdb/wiki/Full-File-Checksum#the-next-step

Per Key/Value Checksum

Encryption at Rest

MultiGet()

See MultiGet Performance for background. We have the following related projects in various stages of planning and implementation -

  • Support partitioned filter and index - The first phase of MultiGet provided significant performance improvement for full filter block and index, through various techniques such as reusing blocks, reusing index iterators, prefetching CPU cachelines etc. We plan to extend these to partitioned filters and indexes.
  • Parallelize file reads in a single level - Currently MultiGet can parallelize reads to the same SST file. We plan to enhance this by parallelizing reads across all files in a single LSM level, thus benefiting more workloads.
  • Deadline/timeouts - Users will be able to specify a deadline for a MultiGet request, and RocksDB will abort the request if the deadline is exceeded.
  • Limit cumulative value size - Users will be able to specify an upper limit on the total size of values read by MultiGet, in order to control memory overhead.

Bloom Filter Improvements

First phase complete, including

Planned:

  • Minimize memory internal fragmentation on generated filters (https://github.com/facebook/rocksdb/pull/6427)
  • Investigate use of different bits/key for different levels (as in Monkey)
  • Investigate use of alternative data structures, most likely based on perfect hashing static functions. See Xor filter, modified with "fuse graph" construction. Or even sgauss. We don't expect much difference in query times, but the primary trade-off to be between construction time and memory footprint for a given false positive rate. It's likely that L0 will continue to construct Bloom filters (fast memtable flushes) while compaction will spend more time to generate more compact structures.
  • Re-vamp how filters are configured (based on above developments), probably moving away from bits/key as a proxy for accuracy.

Improving Testing

Fuzz Testing

https://github.com/facebook/rocksdb/tree/master/fuzz

Adaptive Compaction

Improving RocksDB Backups

Improving Memory Efficiency

DRAM is identified as an opportunity to achieve higher memory efficiency. It is getting increasingly attractive for a host to operate on denser SSD drives, which forces a lower DRAM/SSD size ratio. For example, users may find it more cost effective to run on "Storage Optimized" EC2 host, whose DRAM/SSD ratio is usually 1:31 (Dec 2020). While RocksDB can functionally operate on such a ratio, RocksDB should push the performance limit to those set-ups.

There are some ongoing or planned projects there:

  • Projects-Being-Developed#bloom-filter-improvements discussed above
  • Track and strictly cap all memory usage by RocksDB with one single limit
  • Seeking more compact index format
  • Make RocksDB more friendly to jemalloc to reduce fragmentation
  • Improve partitioned index to reduce the performance issues
  • Compress data in block cache. To make better use of DRAM for block cache, compressing it is a straight-forward idea. Compressed cache or relying on OS page cache can achieve some of them, but they don't work well when DRAM/SSD ratio is low. We need to look for a better solution to compress some data there.

Including https://github.com/facebook/rocksdb/issues/6521

Contents

Clone this wiki locally