deepspeed-chat: filter stage3 too long prompts #782

mosheisland · 2023-10-19T08:34:23Z

In case stage3 prompts are too long, the prompts are used but they are arbitrary sliced at start to fit into the configured max prompt length. This arbitrary slicing sometimes causes prompts to be less meaningful. Which in turn, causes the generator to generate garbage. This phenomena was observed to de-stabilize RLHF stage3. To fix it, we filter prompts that are too long.

In addition, dataset rebuild flag is propagated to other required consumers. Note that since generated dataset are cached in disk, this commit will have effect only if we cleanup step3 cached datasets.

Change-Id: I440f09decf0784e4c2c8167a893006dff312281b

lekurile · 2023-11-08T21:14:43Z

Hi @mosheisland,

Thank you for the contribution!

The reasoning behind this PR makes sense to me, but I'd be curious on the implications of training convergence for the OPT model. We've previously done a full OPT Step 3 sweep across various configurations and found training to converge across all tested cases:
https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-chat/ds-chat-release-8-31/README.md#4-stability-bug-fixes-

We have a sweeping script to make characterization simple:
https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/opt/single_node/sweep

I'd be very curious on the implications of this prompt change on the convergence. I don't expect this to necessarily negatively affect it, but probably worth running nonetheless.

Thanks,
Lev

In case stage3 prompts are too long, the prompts are used but they are arbitrary sliced at start to fit into the configured max prompt length. This arbitrary slicing sometimes causes prompts to be less meaningful. Which in turn, causes the generator to generate garbage. This phenomena was observed to de-stabilize RLHF stage3. To fix it, we filter prompts that are too long. In addition, dataset rebuild flag is propagated to other required consumers. Note that since generated dataset are cached in disk, this commit will have effect only if we cleanup step3 cached datasets. Change-Id: I440f09decf0784e4c2c8167a893006dff312281b Signed-off-by: Moshe Island <[email protected]>

mosheisland · 2023-11-20T08:58:52Z

Hi @lekurile,

I have run the sweep with and without this PR.
Generally, in almost all configurations, using this PR improves results.
The details are below.

@lekurile, please note that currently the sweep test is broken due to commit: 0855679 Fix typo (#749)
This commit fixes the typo of self.rwtranrsformer to self.rwtransformer.
However, since the sweep script uses step1 and step2 models before this commit, the weights are not loaded correctly.
In order to run the sweep, I reverted this commit locally.

Please also note that the number of training steps with this PR is smaller than without.
This is due to the filtering.

DeepSpeed versions used for testing:
GENERIC sweep: DeepSpeed f036f00c, same as was used in stability-bug-fixes
MPL sweep: DeepSpeed 6ea44d02

Instead of 16xV100, I used 8xA100-80G.
Therefore, I increased both --per_device_generation_batch_size and --per_device_training_batch_size from 4 to 8.

RESULTS

Following are GENERIC sweep EMA reward results:

CONFIGURATION                           BEFORE                   WITH_PR
z2_he_true_offload_true_lora_true       0.4753948346725629       0.6010128015626672
z2_he_true_offload_false_lora_true      0.4793704747405848       0.6044063562798943
z2_he_true_offload_false_lora_false     0.44937469644538547      0.5898120794270199
z2_he_false_offload_true_lora_true      0.47848488088442004      0.6145153250498651
z2_he_false_offload_true_lora_false     0.47772121644609994      0.4705303483370994
z2_he_false_offload_false_lora_true     0.4942085498792294       0.625479032328376
z2_he_false_offload_false_lora_false    -0.049893910540181136    0.6374905284590799
z3_he_true_offload_true_lora_true       0.4692788802627764       0.6190144034178486
z3_he_true_offload_true_lora_false      0.5394484735740004       -0.050979439449979125
z3_he_true_offload_false_lora_true      0.4813882254539342       0.5539103174328391
z3_he_true_offload_false_lora_false     0.43964012828155885      0.6193358555058606
z3_he_false_offload_true_lora_true      0.5034960575985696       0.5916298786973844
z3_he_false_offload_true_lora_false     0.2312764145956392       0.2544757543793474
z3_he_false_offload_false_lora_true     0.4756843619519737       0.5746860958220634
z3_he_false_offload_false_lora_false    0.11386992699245523      0.5343188981333544
AVERAGE                                 0.4039162140826          0.5226425490255

Following are MPL sweep EMA reward results:

CONFIGURATION                                  BEFORE                WITH_PR
z3_he_true_offload_false_lora_true_mpl_true    0.425907800430223     0.6460371105349498
z3_he_false_offload_false_lora_true_mpl_true   0.45612089316449245   0.6015554197083343

Following are tensor board images of GENERIC sweep runs before-this-PR and with-this-PR:

BEFORE:

WITH this PR:

Following are tensor board images of MPL sweep runs before-this-PR and with-this-PR:

BEFORE:

WITH this PR:

lekurile · 2023-11-21T00:18:02Z

Hi @lekurile,

I have run the sweep with and without this PR. Generally, in almost all configurations, using this PR improves results. The details are below.

@lekurile, please note that currently the sweep test is broken due to commit: 0855679 Fix typo (#749) This commit fixes the typo of self.rwtranrsformer to self.rwtransformer. However, since the sweep script uses step1 and step2 models before this commit, the weights are not loaded correctly. In order to run the sweep, I reverted this commit locally.

Please also note that the number of training steps with this PR is smaller than without. This is due to the filtering.

DeepSpeed versions used for testing: GENERIC sweep: DeepSpeed f036f00c, same as was used in stability-bug-fixes MPL sweep: DeepSpeed 6ea44d02

Instead of 16xV100, I used 8xA100-80G. Therefore, I increased both --per_device_generation_batch_size and --per_device_training_batch_size from 4 to 8.

RESULTS

Following are GENERIC sweep EMA reward results:
CONFIGURATION                           BEFORE                   WITH_PR
z2_he_true_offload_true_lora_true       0.4753948346725629       0.6010128015626672
z2_he_true_offload_false_lora_true      0.4793704747405848       0.6044063562798943
z2_he_true_offload_false_lora_false     0.44937469644538547      0.5898120794270199
z2_he_false_offload_true_lora_true      0.47848488088442004      0.6145153250498651
z2_he_false_offload_true_lora_false     0.47772121644609994      0.4705303483370994
z2_he_false_offload_false_lora_true     0.4942085498792294       0.625479032328376
z2_he_false_offload_false_lora_false    -0.049893910540181136    0.6374905284590799
z3_he_true_offload_true_lora_true       0.4692788802627764       0.6190144034178486
z3_he_true_offload_true_lora_false      0.5394484735740004       -0.050979439449979125
z3_he_true_offload_false_lora_true      0.4813882254539342       0.5539103174328391
z3_he_true_offload_false_lora_false     0.43964012828155885      0.6193358555058606
z3_he_false_offload_true_lora_true      0.5034960575985696       0.5916298786973844
z3_he_false_offload_true_lora_false     0.2312764145956392       0.2544757543793474
z3_he_false_offload_false_lora_true     0.4756843619519737       0.5746860958220634
z3_he_false_offload_false_lora_false    0.11386992699245523      0.5343188981333544
AVERAGE                                 0.4039162140826          0.5226425490255
Following are MPL sweep EMA reward results:
CONFIGURATION                                  BEFORE                WITH_PR
z3_he_true_offload_false_lora_true_mpl_true    0.425907800430223     0.6460371105349498
z3_he_false_offload_false_lora_true_mpl_true   0.45612089316449245   0.6015554197083343
Following are tensor board images of GENERIC sweep runs before-this-PR and with-this-PR:

BEFORE:

WITH this PR:

Following are tensor board images of MPL sweep runs before-this-PR and with-this-PR:

BEFORE:

WITH this PR:

Thank for the amazing work @mosheisland! Appreciate the thoroughness. The data looks good and shows that the training is still very stable.

I'll approve the PR and run the tests.

Thanks,
Lev

mosheisland requested review from jeffra, tjruwase, ShadenSmith, conglongli, awan-10, eltonzheng, minjiaz, RezaYazdaniAminabadi, duli2012, mrwyattii, yaozhewei, arashb and xiaoxiawu-microsoft as code owners October 19, 2023 08:34

tjruwase requested review from lekurile and removed request for arashb, ShadenSmith, jeffra, duli2012, conglongli, awan-10, mrwyattii, yaozhewei, eltonzheng, minjiaz, RezaYazdaniAminabadi and xiaoxiawu-microsoft October 19, 2023 15:03

lekurile approved these changes Nov 8, 2023

View reviewed changes

mosheisland force-pushed the 16_filter_stage3_too_long_prompts branch from 06a94f8 to ad55abc Compare November 20, 2023 07:38

tjruwase merged commit 8c551d2 into microsoft:master Nov 21, 2023
2 checks passed

mosheisland deleted the 16_filter_stage3_too_long_prompts branch November 22, 2023 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed-chat: filter stage3 too long prompts #782

deepspeed-chat: filter stage3 too long prompts #782

mosheisland commented Oct 19, 2023

lekurile commented Nov 8, 2023

mosheisland commented Nov 20, 2023

lekurile commented Nov 21, 2023

deepspeed-chat: filter stage3 too long prompts #782

deepspeed-chat: filter stage3 too long prompts #782

Conversation

mosheisland commented Oct 19, 2023

lekurile commented Nov 8, 2023

mosheisland commented Nov 20, 2023

lekurile commented Nov 21, 2023