Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch size tuning docs #341

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Add batch size tuning docs #341

wants to merge 2 commits into from

Conversation

Infernaught
Copy link
Collaborator

Add docs for clarity on batch size tuning and the differences between ECD and LLM batch size tuning. DO NOT MERGE YET. Waiting on LLM batch size tuning PR to land.

@@ -0,0 +1,10 @@
To maximize efficiency, Ludwig performs automatic batch size tuning when the `batch_size` parameter is noet in the configuration in order to best saturate the GPU. Batch size tuning does not occur during CPU training due to the lack of effective parallelization, and Ludwig instead sets the batch size to a fixed value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for a minor rewrite:

"In Ludwig, users have the option to set batch_size to a fixed value as part of the training config.

trainer:
  batch_size: 128

If the batch size is unspecified Ludwig sets batch_size=auto.

trainer:
  batch_size: auto

auto enables Ludwig to select an efficient batch size automatically. The actual value of the batch size can be found in training logs and in the model output directory.

Batch size tuning is supported in single-node and multi-node CPU and GPU settings.

ECD Models

Batch size tuning for ECD models follows this procedure, starting from batch size 1:

  1. Perform a small number of forward passes through the model using a sample from the dataset and observe whether the model hits a memory error and the overall throughput speed (examples/sec).
  2. If the model hits a memory error or if throughput decreases, then use the last valid batch size. Otherwise, double the batch size and repeat step 1.

LLMs

The main element that separates LLM batch size tuning from its ECD counterpart is the sequence length. LLM's thus undergo the same batch size tuning process as ECD models with the exception being that, instead of using a random sample from the dataset, the forward passes use a synthetic data sample with a sequence length equal to specified max sequence length (or the longest sequence length in the provided dataset if max sequence length is unspecified)."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants