Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cli arg to specify max parquet fanout #25714

Merged
merged 1 commit into from
Dec 27, 2024
Merged

Conversation

hiltontj
Copy link
Contributor

This is needed for https://github.com/influxdata/influxdb_pro/issues/308

This allows the max_parquet_fanout to be specified in the CLI for the influxdb3 serve command. This could be done previously via the --datafusion-config CLI argument, but the drawbacks to that were:

  1. that is a fairly advanced option given the available key/value pairs are not well documented
  2. if iox.max_parquet_fanout was not provided to that argument, the default would be set to 40

This PR maintains the existing --datafusion-config CLI argument (with one caveat, see below) which allows users to provide a set key/value pairs that will be used to build the internal DataFusion config, but in addition provides the --datafusion-max-parquet-fanout argument:

    --datafusion-max-parquet-fanout <MAX_PARQUET_FANOUT>
          When multiple parquet files are required in a sorted way (e.g. for de-duplication), we have two options:

          1. **In-mem sorting:** Put them into `datafusion.target_partitions` DataFusion partitions. This limits the fan-out, but requires that we potentially chain multiple parquet files into a single DataFusion partition. Since chaining sorted data does NOT automatically result in sorted data (e.g. AB-AB is not sorted), we need to preform an in-memory sort using `SortExec` afterwards. This is expensive. 2. **Fan-out:** Instead of chaining files within DataFusion partitions, we can accept a fan-out beyond `target_partitions`. This prevents in-memory sorting but may result in OOMs (out-of-memory) if the fan-out is too large.

          We try to pick option 2 up to a certain number of files, which is configured by this setting.

          [env: INFLUXDB3_DATAFUSION_MAX_PARQUET_FANOUT=]
          [default: 1000]

with the default value of 1000, which will override the core iox_query default of 40.

A test was added to check that this is propagated down to the IOxSessionContext that is used during queries.

The only change to the datafusion-config CLI argument was to rename INFLUXDB_IOX in the environment variable to INFLUXDB3:

    --datafusion-config <DATAFUSION_CONFIG>
          Provide custom configuration to DataFusion as a comma-separated list of key:value pairs.

          # Example ```text --datafusion-config "datafusion.key1:value1, datafusion.key2:value2" ```

          [env: INFLUXDB3_DATAFUSION_CONFIG=]
          [default: ]

@hiltontj hiltontj added the v3 label Dec 27, 2024
@hiltontj hiltontj self-assigned this Dec 27, 2024
@hiltontj hiltontj merged commit 03ea565 into main Dec 27, 2024
13 checks passed
@hiltontj hiltontj deleted the hiltontj/parquet-fan-out branch December 27, 2024 17:42
#[clap(
long = "datafusion-max-parquet-fanout",
env = "INFLUXDB3_DATAFUSION_MAX_PARQUET_FANOUT",
default_value = "1000",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking if 1000 is a good default value, I understand this depends on the size of the files but given it can result in OOM just wanted to double check 1000 is still good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to call this out. I copied the comment from IOx/core to preserve the context it provided. I think we may need to tune this a bit, or it could be possible to base the default on the system memory, and how we allocate memory in different modes in pro.

As it stands, with the low default of 40, we are getting OOMs with the fallback, i.e., non-fanout, query plan, so we should know soon if increasing this much makes the problem worse or not. Based on https://github.com/influxdata/influxdb_pro/issues/308#issuecomment-2562955195, this default may be a bit low/out-dated (perhaps the way the DataFusion plan handles fanout is different than when the default was decided). There are some distributed clusters in IOx setting this to 800 as per https://github.com/influxdata/influxdb_pro/issues/308#issuecomment-2563245404.

We'll see how this goes - at the minimum, I got the env vars switched from INFLUXDB_IOX_ to INFLUXDB3_ 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have misunderstood the docs for this setting, I interpreted it as, the higher this number the more files it tries to fan-out, which leads to OOMs. If we don't fan-out then it leads to doing expensive in memory sorting (guessing without running into OOMs?).

Copy link
Contributor Author

@hiltontj hiltontj Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(guessing without running into OOMs?)

Unfortunately, though, it is OOM'ing without the fanout, while not OOM'ing with the fanout, so we may need to update this doc comment (see https://github.com/influxdata/influxdb_pro/issues/205#issuecomment-2565377397)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the memory sort is going to OOM, unless you set a memory limit on DF, but in that case it just means that the query will get killed and return a resource exhaustion error. The only way around that I can think of is if spill to disk is enabled, but that's not really much better either.

I think the fanout setting should effectively be ignored (i.e. set to whatever the max of the type is). Resorting the data is always going to be more expensive and completely unnecessary in our case.

If DF allocates an arrow buffer for each input file, then you'd have that size * num of files. The Arrow buffer could be quite large if there are very wide tables and depending on the size of that buffer. I think one way to counter this would be to make sure that the pre-allocated buffer is limited in size or scaled down depending on the number of input files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants