Partitioned Index Filters

As DB/mem ratio gets larger, the memory footprint of filter/index blocks becomes non-trivial. Although cache_index_and_filter_blocks allows storing only a subset of them in block cache, their relatively large size negatively affects performance by i) occupying the block cache space that could otherwise be used for caching data, ii) increasing load on the disk storage for loading them into the cache after a miss. Here we illustrate these problems in more detail and explain how partitioning index/filters alleviates the overhead.

How large are the index/filter blocks?

RocksDB has by default one index/filter block per SST file. The size of index/filter varies based on the configuration but for a SST of size 256MB the index/filter block of 0.5/5MB is typical, which is much larger than the typical data block size of 4-32KB. That is fine when all index/filters fit into memory and hence are read once per SST lifetime, not so much when they compete with data blocks for the block cache space and are also likely to be re-read many times from the disk.

What is the big deal with large index/filter blocks?

When index/filter blocks are stored in block cache they are effectively competing with data blocks (as well as with each other) on this scarce resource. A filter of size 5MB is occupying the space that could otherwise be used to cache 1000s of data blocks (of size 4KB). This would result in more cache misses for data blocks. The large index/filters also kick each other out of the block cache more often and increase their own cache miss too. This is while only a small part of the index/filter block might have been actually used during its lifetime in the cache.

After a index/filter cache miss, it has to be reloaded from the disk and its large size is not helping in reducing the IO cost. While a simple point lookup might need at most a couple of data block reads (of size 4KB) one from each layer of LSM, it might end up also loading multiple megabytes of index/filter blocks. If that happens often then the disk is spending more time serving index/filters rather than the actual data blocks.

What is partitioned index/filters?

With partitioning, the index/filter of a SST file is partitioned into smaller blocks with an additional top-level index on them. When reading an index/filter, only top-level index is loaded into memory. The partitioned index/filter then uses the top-level index to lazily load into the block cache the partitions that are required to perform the index/filter query. The top-level index, which has much smaller memory footprint, can be stored in heap or block cache depending on the cache_index_and_filter_blocks setting.

Pros:

Higher cache hit rate: Instead of polluting the cache space with large index/blocks, partitioning allows loading index/filters with much finer granularity and hence making effective use of the cache space.
Less IO util: Upon a cache miss for an index/filter partition, only one partition requires to be loaded from the disk, which results into much lighter load on disk compared to reading the entire index/filter of the SST file.
No compromise on index/filters: Without partitioning the alternative approach to reduce memory footprint of index/filters is to sacrifice their accuracy by for example larger data blocks or lower bloom bits to have smaller index and filters respectively.

Cons:

Additional space for the top-level index: its quite small 0.1-1% of index/filter size.
More disk IO: if the top-level index is not already in cache it would result to one additional IO. To avoid that they can be either stored in heap or stored in cache with hi priority (TODO work)
Losing spatial locality: if a workload requires frequent, yet random reads from the same SST file it would result into loading a separate index/filter partition upon each read, which is less efficient than reading the index/filter at once. Although we did not observe this pattern in our benchmarks, it is only likely to happen for L0/L1 layers of LSM, for which partitioning can be disabled (TODO work)

How to use it?

index_type = IndexType::kTwoLevelIndexSearch
- This is to enable partitioned indexes
metadata_block_size = 4096
- This is the block size for index partitions.
cache_index_and_filter_blocks = false
- The partitions are stored in the cache anyway. This to control the location of top-level indexes, which easily fit into memory. Having them stored in block cache is less experimented with.
cache_index_and_filter_blocks_with_high_priority = true
- Recommended setting
block cache size: if you used to store the filter/index into heap, do not forget to increase the block cache size with the amount of memory that you are saving from the heap.

Current limitations

Partitioned filters cannot be enabled without having partitioned index enabled as well.
We have the same number of filter and index partitions. In other words, whenever an index block is cut, the filter block is cut as well. We might change it in future if it shows to be causing deficiencies.
The filter block size is determined by when the index block is cut. We will soon extend metadata_block_size to be enforce the maximum size on both filter and index blocks, i.e., a filter block is cut either when an index block is cut or when its size is about to exceed metadata_block_size (TODO).

Under the hood

Here we present the implementation details for the developers.

BlockBasedTable Format

You can study the BlockBasedTable format here. With partitioning the difference would be that the index block

[index block]

is stored as

[index block - partition 1]
[index block - partition 2]
...
[index block - partition N]
[index block - top-level index]

and the footer of SST points to the top-level index block (which by itself is an index on index partition blocks). Each individual index block partition conforms the same format as kBinarySearch. The top-level index format also conforms with that of kBinarySearch and hence can be read using the normal data block readers.

Similar structure is used for partitioning the filter blocks. The format of each individual filter block conforms with that of kFullFilter. The top-level index format conforms with that of kBinarySearch, similar to top-level index of index blocks.

Note that with partitioning the SST inspection tools such sst_dump report the size of top-level index on index/filters rather than the collective size of index/filter blocks.

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly