Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed-chat: filter stage3 too long prompts #782

Merged

Conversation

mosheisland
Copy link
Contributor

In case stage3 prompts are too long, the prompts are used but they are arbitrary sliced at start to fit into the configured max prompt length. This arbitrary slicing sometimes causes prompts to be less meaningful. Which in turn, causes the generator to generate garbage. This phenomena was observed to de-stabilize RLHF stage3. To fix it, we filter prompts that are too long.

In addition, dataset rebuild flag is propagated to other required consumers. Note that since generated dataset are cached in disk, this commit will have effect only if we cleanup step3 cached datasets.

Change-Id: I440f09decf0784e4c2c8167a893006dff312281b

@lekurile
Copy link
Contributor

lekurile commented Nov 8, 2023

Hi @mosheisland,

Thank you for the contribution!

The reasoning behind this PR makes sense to me, but I'd be curious on the implications of training convergence for the OPT model. We've previously done a full OPT Step 3 sweep across various configurations and found training to converge across all tested cases:
https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-chat/ds-chat-release-8-31/README.md#4-stability-bug-fixes-

We have a sweeping script to make characterization simple:
https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/opt/single_node/sweep

I'd be very curious on the implications of this prompt change on the convergence. I don't expect this to necessarily negatively affect it, but probably worth running nonetheless.

Thanks,
Lev

In case stage3 prompts are too long, the prompts are used but they are
arbitrary sliced at start to fit into the configured max prompt length.
This arbitrary slicing sometimes causes prompts to be less meaningful.
Which in turn, causes the generator to generate garbage.
This phenomena was observed to de-stabilize RLHF stage3.
To fix it, we filter prompts that are too long.

In addition, dataset rebuild flag is propagated to other required consumers.
Note that since generated dataset are cached in disk, this commit will have
effect only if we cleanup step3 cached datasets.

Change-Id: I440f09decf0784e4c2c8167a893006dff312281b
Signed-off-by: Moshe Island <[email protected]>
@mosheisland mosheisland force-pushed the 16_filter_stage3_too_long_prompts branch from 06a94f8 to ad55abc Compare November 20, 2023 07:38
@mosheisland
Copy link
Contributor Author

Hi @lekurile,

I have run the sweep with and without this PR.
Generally, in almost all configurations, using this PR improves results.
The details are below.

@lekurile, please note that currently the sweep test is broken due to commit: 0855679 Fix typo (#749)
This commit fixes the typo of self.rwtranrsformer to self.rwtransformer.
However, since the sweep script uses step1 and step2 models before this commit, the weights are not loaded correctly.
In order to run the sweep, I reverted this commit locally.

Please also note that the number of training steps with this PR is smaller than without.
This is due to the filtering.

DeepSpeed versions used for testing:
GENERIC sweep: DeepSpeed f036f00c, same as was used in stability-bug-fixes
MPL sweep: DeepSpeed 6ea44d02

Instead of 16xV100, I used 8xA100-80G.
Therefore, I increased both --per_device_generation_batch_size and --per_device_training_batch_size from 4 to 8.

RESULTS

Following are GENERIC sweep EMA reward results:

CONFIGURATION                           BEFORE                   WITH_PR
z2_he_true_offload_true_lora_true       0.4753948346725629       0.6010128015626672
z2_he_true_offload_false_lora_true      0.4793704747405848       0.6044063562798943
z2_he_true_offload_false_lora_false     0.44937469644538547      0.5898120794270199
z2_he_false_offload_true_lora_true      0.47848488088442004      0.6145153250498651
z2_he_false_offload_true_lora_false     0.47772121644609994      0.4705303483370994
z2_he_false_offload_false_lora_true     0.4942085498792294       0.625479032328376
z2_he_false_offload_false_lora_false    -0.049893910540181136    0.6374905284590799
z3_he_true_offload_true_lora_true       0.4692788802627764       0.6190144034178486
z3_he_true_offload_true_lora_false      0.5394484735740004       -0.050979439449979125
z3_he_true_offload_false_lora_true      0.4813882254539342       0.5539103174328391
z3_he_true_offload_false_lora_false     0.43964012828155885      0.6193358555058606
z3_he_false_offload_true_lora_true      0.5034960575985696       0.5916298786973844
z3_he_false_offload_true_lora_false     0.2312764145956392       0.2544757543793474
z3_he_false_offload_false_lora_true     0.4756843619519737       0.5746860958220634
z3_he_false_offload_false_lora_false    0.11386992699245523      0.5343188981333544
AVERAGE                                 0.4039162140826          0.5226425490255

Following are MPL sweep EMA reward results:

CONFIGURATION                                  BEFORE                WITH_PR
z3_he_true_offload_false_lora_true_mpl_true    0.425907800430223     0.6460371105349498
z3_he_false_offload_false_lora_true_mpl_true   0.45612089316449245   0.6015554197083343

Following are tensor board images of GENERIC sweep runs before-this-PR and with-this-PR:

BEFORE:
opt-sweep-

WITH this PR:
opt-sweep-with-PR

Following are tensor board images of MPL sweep runs before-this-PR and with-this-PR:

BEFORE:
opt-sweep-mp

WITH this PR:
opt-sweep-mp-with-pr

@lekurile
Copy link
Contributor

Hi @lekurile,

I have run the sweep with and without this PR. Generally, in almost all configurations, using this PR improves results. The details are below.

@lekurile, please note that currently the sweep test is broken due to commit: 0855679 Fix typo (#749) This commit fixes the typo of self.rwtranrsformer to self.rwtransformer. However, since the sweep script uses step1 and step2 models before this commit, the weights are not loaded correctly. In order to run the sweep, I reverted this commit locally.

Please also note that the number of training steps with this PR is smaller than without. This is due to the filtering.

DeepSpeed versions used for testing: GENERIC sweep: DeepSpeed f036f00c, same as was used in stability-bug-fixes MPL sweep: DeepSpeed 6ea44d02

Instead of 16xV100, I used 8xA100-80G. Therefore, I increased both --per_device_generation_batch_size and --per_device_training_batch_size from 4 to 8.

RESULTS

Following are GENERIC sweep EMA reward results:

CONFIGURATION                           BEFORE                   WITH_PR
z2_he_true_offload_true_lora_true       0.4753948346725629       0.6010128015626672
z2_he_true_offload_false_lora_true      0.4793704747405848       0.6044063562798943
z2_he_true_offload_false_lora_false     0.44937469644538547      0.5898120794270199
z2_he_false_offload_true_lora_true      0.47848488088442004      0.6145153250498651
z2_he_false_offload_true_lora_false     0.47772121644609994      0.4705303483370994
z2_he_false_offload_false_lora_true     0.4942085498792294       0.625479032328376
z2_he_false_offload_false_lora_false    -0.049893910540181136    0.6374905284590799
z3_he_true_offload_true_lora_true       0.4692788802627764       0.6190144034178486
z3_he_true_offload_true_lora_false      0.5394484735740004       -0.050979439449979125
z3_he_true_offload_false_lora_true      0.4813882254539342       0.5539103174328391
z3_he_true_offload_false_lora_false     0.43964012828155885      0.6193358555058606
z3_he_false_offload_true_lora_true      0.5034960575985696       0.5916298786973844
z3_he_false_offload_true_lora_false     0.2312764145956392       0.2544757543793474
z3_he_false_offload_false_lora_true     0.4756843619519737       0.5746860958220634
z3_he_false_offload_false_lora_false    0.11386992699245523      0.5343188981333544
AVERAGE                                 0.4039162140826          0.5226425490255

Following are MPL sweep EMA reward results:

CONFIGURATION                                  BEFORE                WITH_PR
z3_he_true_offload_false_lora_true_mpl_true    0.425907800430223     0.6460371105349498
z3_he_false_offload_false_lora_true_mpl_true   0.45612089316449245   0.6015554197083343

Following are tensor board images of GENERIC sweep runs before-this-PR and with-this-PR:

BEFORE: opt-sweep-

WITH this PR: opt-sweep-with-PR

Following are tensor board images of MPL sweep runs before-this-PR and with-this-PR:

BEFORE: opt-sweep-mp

WITH this PR: opt-sweep-mp-with-pr

Thank for the amazing work @mosheisland! Appreciate the thoroughness. The data looks good and shows that the training is still very stable.

I'll approve the PR and run the tests.

Thanks,
Lev

@tjruwase tjruwase merged commit 8c551d2 into microsoft:master Nov 21, 2023
2 checks passed
@mosheisland mosheisland deleted the 16_filter_stage3_too_long_prompts branch November 22, 2023 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants