Enable context parallelism in SFT #190

slimfrkha · 2024-12-24T12:39:14Z

solve #189

lm loss of cp 2 and cp 1 are very similar in the fixed branch compared to main branch.

CLAassistant · 2024-12-24T12:39:20Z

All committers have signed the CLA.

haolin-nju

Good first contribution! Code looks good to me in general. However, I can't find lm_loss on fixed_cp1. Could you please attach it to the description?

BTW, could you please provide MT-Bench results on supervised fine-tuning Llama2-7B base model with CP enabled and disabled? We believe that the MT-Bench result will ensure the robustness of the PR. (You could find related instructions in https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md. Besides, you could refer to https://code.alibaba-inc.com/torchx/rlhf/blob/master/docs/en/tutorial/data.md for data preparation and preprocessing. Ideally, the MT-Bench result shall be align with those in https://github.com/alibaba/ChatLearn/blob/main/docs/en/tutorial/tutorial_llama2.md#evaluation). If you have any other questions, please feel free to contact us. We are always glad to help ;)

haolin-nju · 2024-12-25T02:05:38Z

examples/megatron/entry/train_sft.py

@@ -81,6 +87,10 @@ def model_provider(pre_process=True, post_process=True):

 def get_batch(data_iterator):
    """Generate a batch"""
+
+    if (not mpu.is_pipeline_first_stage()) and (not mpu.is_pipeline_last_stage()):


this line of code can be simplified

slimfrkha · 2024-12-25T10:50:24Z

Good first contribution! Code looks good to me in general. However, I can't find lm_loss on fixed_cp1. Could you please attach it to the description?

BTW, could you please provide MT-Bench results on supervised fine-tuning Llama2-7B base model with CP enabled and disabled? We believe that the MT-Bench result will ensure the robustness of the PR. (You could find related instructions in https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md. Besides, you could refer to https://code.alibaba-inc.com/torchx/rlhf/blob/master/docs/en/tutorial/data.md for data preparation and preprocessing. Ideally, the MT-Bench result shall be align with those in https://github.com/alibaba/ChatLearn/blob/main/docs/en/tutorial/tutorial_llama2.md#evaluation). If you have any other questions, please feel free to contact us. We are always glad to help ;)

about fixed_cp1, it is behind buggy_cp1

BTW the code is roughly copy pasting from pretrain_gpt of Megatron LM.
Nothing fancy about it.

haolin-nju · 2024-12-26T05:51:17Z

Good first contribution! Code looks good to me in general. However, I can't find lm_loss on fixed_cp1. Could you please attach it to the description?
BTW, could you please provide MT-Bench results on supervised fine-tuning Llama2-7B base model with CP enabled and disabled? We believe that the MT-Bench result will ensure the robustness of the PR. (You could find related instructions in https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md. Besides, you could refer to https://code.alibaba-inc.com/torchx/rlhf/blob/master/docs/en/tutorial/data.md for data preparation and preprocessing. Ideally, the MT-Bench result shall be align with those in https://github.com/alibaba/ChatLearn/blob/main/docs/en/tutorial/tutorial_llama2.md#evaluation). If you have any other questions, please feel free to contact us. We are always glad to help ;)

about fixed_cp1, it is behind buggy_cp1

BTW the code is roughly copy pasting from pretrain_gpt of Megatron LM. Nothing fancy about it.

On one hand, we have to review the related license and ensure that everything is ok if the code is copied from other open-sourced repo. On the other hand, evaluating MT-Bench is necessary (or one of the necessary TODOs) to guarantee performance and reproducibility in ChatLearn. Therefore, it will take some time for us to go through all processes before merging this PR. It's our pleasure if you could help provide the MT-Bench result on this PR because we could double-check it in regression test. Again, thanks a lot for the contribution~!

fix(train_sft.py): enable cp in get_batch

ebdba19

style(train_sft.py): pylint

ab7c957

haolin-nju reviewed Dec 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable context parallelism in SFT #190

Enable context parallelism in SFT #190

slimfrkha commented Dec 24, 2024

CLAassistant commented Dec 24, 2024 •

edited

Loading

haolin-nju left a comment •

edited

Loading

haolin-nju Dec 25, 2024

slimfrkha commented Dec 25, 2024 •

edited

Loading

haolin-nju commented Dec 26, 2024 •

edited

Loading

Enable context parallelism in SFT #190

Are you sure you want to change the base?

Enable context parallelism in SFT #190

Conversation

slimfrkha commented Dec 24, 2024

CLAassistant commented Dec 24, 2024 • edited Loading

haolin-nju left a comment • edited Loading

Choose a reason for hiding this comment

haolin-nju Dec 25, 2024

Choose a reason for hiding this comment

slimfrkha commented Dec 25, 2024 • edited Loading

haolin-nju commented Dec 26, 2024 • edited Loading

CLAassistant commented Dec 24, 2024 •

edited

Loading

haolin-nju left a comment •

edited

Loading

slimfrkha commented Dec 25, 2024 •

edited

Loading

haolin-nju commented Dec 26, 2024 •

edited

Loading