-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: cli arg to specify max parquet fanout #25714
Conversation
#[clap( | ||
long = "datafusion-max-parquet-fanout", | ||
env = "INFLUXDB3_DATAFUSION_MAX_PARQUET_FANOUT", | ||
default_value = "1000", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking if 1000 is a good default value, I understand this depends on the size of the files but given it can result in OOM just wanted to double check 1000 is still good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to call this out. I copied the comment from IOx/core to preserve the context it provided. I think we may need to tune this a bit, or it could be possible to base the default on the system memory, and how we allocate memory in different modes in pro.
As it stands, with the low default of 40, we are getting OOMs with the fallback, i.e., non-fanout, query plan, so we should know soon if increasing this much makes the problem worse or not. Based on https://github.com/influxdata/influxdb_pro/issues/308#issuecomment-2562955195, this default may be a bit low/out-dated (perhaps the way the DataFusion plan handles fanout is different than when the default was decided). There are some distributed clusters in IOx setting this to 800 as per https://github.com/influxdata/influxdb_pro/issues/308#issuecomment-2563245404.
We'll see how this goes - at the minimum, I got the env vars switched from INFLUXDB_IOX_
to INFLUXDB3_
😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have misunderstood the docs for this setting, I interpreted it as, the higher this number the more files it tries to fan-out, which leads to OOMs. If we don't fan-out then it leads to doing expensive in memory sorting (guessing without running into OOMs?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(guessing without running into OOMs?)
Unfortunately, though, it is OOM'ing without the fanout, while not OOM'ing with the fanout, so we may need to update this doc comment (see https://github.com/influxdata/influxdb_pro/issues/205#issuecomment-2565377397)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the memory sort is going to OOM, unless you set a memory limit on DF, but in that case it just means that the query will get killed and return a resource exhaustion error. The only way around that I can think of is if spill to disk is enabled, but that's not really much better either.
I think the fanout setting should effectively be ignored (i.e. set to whatever the max of the type is). Resorting the data is always going to be more expensive and completely unnecessary in our case.
If DF allocates an arrow buffer for each input file, then you'd have that size * num of files. The Arrow buffer could be quite large if there are very wide tables and depending on the size of that buffer. I think one way to counter this would be to make sure that the pre-allocated buffer is limited in size or scaled down depending on the number of input files.
This is needed for https://github.com/influxdata/influxdb_pro/issues/308
This allows the
max_parquet_fanout
to be specified in the CLI for theinfluxdb3 serve
command. This could be done previously via the--datafusion-config
CLI argument, but the drawbacks to that were:iox.max_parquet_fanout
was not provided to that argument, the default would be set to40
This PR maintains the existing
--datafusion-config
CLI argument (with one caveat, see below) which allows users to provide a set key/value pairs that will be used to build the internal DataFusion config, but in addition provides the--datafusion-max-parquet-fanout
argument:with the default value of
1000
, which will override the coreiox_query
default of40
.A test was added to check that this is propagated down to the
IOxSessionContext
that is used during queries.The only change to the
datafusion-config
CLI argument was to renameINFLUXDB_IOX
in the environment variable toINFLUXDB3
: