chore(dataobj): add bitmap encoding #15629
Merged
+886
−10
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit adds bitmap encoding, the third and final type of encoding needed for the data object prototype. Bitmap encoding will be used to record NULLs within a page.
Bitmap encoding efficiently stores sequences of uint64 values in a combination of RLE runs or bitpacked runs. RLE runs are long sequences of the same value, while bitpacked runs are runs of 8 values packed together into the smallest possible bit width.
Bitmap encoding is based off of the RLE encoding format used by Parquet, with some notable changes to facilitate streaming writes:
Our bitmap encoding doesn't use a fixed width for values. Instead, the width is determined upon flushing a bitpacked set. Bitpacked sets of the same width are then combined into a single run.
This comes at the cost of an extra byte per bitpacked run.
As values are streamed, the final length of the bitmap isn't included to the bitmap header. Callers can choose to prepend the length by writing the bitmap into a separate buffer and then writing a custom header. Without this, readers must take caution to know the exact number of encoded values to not read past the end of the RLE sequence.
This code is unfortunately quite complex. I've tried to add comments for as much as I could, but if there's an easier way to do the bitpacking, I would love to move over to that.