Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(dataobj): add bitmap encoding #15629

Merged
merged 1 commit into from
Jan 7, 2025
Merged

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Jan 7, 2025

This commit adds bitmap encoding, the third and final type of encoding needed for the data object prototype. Bitmap encoding will be used to record NULLs within a page.

Bitmap encoding efficiently stores sequences of uint64 values in a combination of RLE runs or bitpacked runs. RLE runs are long sequences of the same value, while bitpacked runs are runs of 8 values packed together into the smallest possible bit width.

Bitmap encoding is based off of the RLE encoding format used by Parquet, with some notable changes to facilitate streaming writes:

  • Our bitmap encoding doesn't use a fixed width for values. Instead, the width is determined upon flushing a bitpacked set. Bitpacked sets of the same width are then combined into a single run.

    This comes at the cost of an extra byte per bitpacked run.

  • As values are streamed, the final length of the bitmap isn't included to the bitmap header. Callers can choose to prepend the length by writing the bitmap into a separate buffer and then writing a custom header. Without this, readers must take caution to know the exact number of encoded values to not read past the end of the RLE sequence.

This code is unfortunately quite complex. I've tried to add comments for as much as I could, but if there's an easier way to do the bitpacking, I would love to move over to that.

This commit adds bitmap encoding, the third and final type of encoding
needed for the data object prototype.

Bitmap encoding efficiently stores sequences of uint64 values in a
combination of RLE runs or bitpacked runs. RLE runs are long sequences
of the same value, while bitpacked runs are runs of 8 values packed
together into the smallest possible bit width.

Bitmap encoding is based off of the RLE encoding format used by Parquet,
with some notable changes to facilitate streaming writes:

- Our bitmap encoding doesn't use a fixed width for values. Instead, the
  width is determined upon flushing a bitpacked set. Bitpacked sets of
  the same width are then combined into a single run.

  This comes at the cost of an extra byte per bitpacked run.

- As values are streamed, the final length of the bitmap isn't included
  to the bitmap header. Callers can choose to prepend the length by
  writing the bitmap into a separate buffer and then writing a custom
  header. Without this, readers must take caution to know the exact
  number of encoded values to not read past the end of the RLE sequence.

This code is unfortunately quite complex. I've tried to add comments for
as much as I could, but if there's an easier way to do the bitpacking, I
would love to move over to that.
@rfratto rfratto marked this pull request as ready for review January 7, 2025 15:30
@rfratto rfratto requested a review from a team as a code owner January 7, 2025 15:30
Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rfratto rfratto merged commit 6042d07 into grafana:main Jan 7, 2025
59 checks passed
@rfratto rfratto deleted the dataset-bitmap branch January 7, 2025 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants