NVIDIA / Megatron-LM Public

Notifications You must be signed in to change notification settings
Fork 2.4k
Star 10.9k

Code
Issues 159
Pull requests 152
Discussions
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

Issues: NVIDIA/Megatron-LM

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

159 Open 652 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[QUESTION]"a2a+p2p" for context parallel(cp)

#1341 opened Dec 27, 2024 by heavyrain-lzy

[QUESTION]How to convert the weight file format of the MAMBA model from pt to safetensors format?

#1339 opened Dec 26, 2024 by fxnie

[QUESTION] Why mixral use Llama2Tokenizer?

#1338 opened Dec 25, 2024 by DemingCheng

[QUESTION] Why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param?

#1335 opened Dec 24, 2024 by renyinCheng001

[QUESTION]How can I load a checkpoint trained by Megatron-LM 0.5 into Megatron-LM 0.7 to resume pretraing?

#1333 opened Dec 22, 2024 by IgorZan

[BUG] MoE load balancing loss is accumulated twice when using activation checkpointing

#1330 opened Dec 20, 2024 by thuwzt

[BUG]megatron-lm，with torchompile，The provided qkv memory layout is not supported!

#1329 opened Dec 20, 2024 by qingshanxwx

[QUESTION] Why doesn't GPTDataset build a global shuffle index?

#1328 opened Dec 20, 2024 by dynamicheart

[BUG] Precision issue caused by different token dispatchers in MoE training

#1327 opened Dec 17, 2024 by qi7kuo

[QUESTION] About using StreamingLLM

#1326 opened Dec 17, 2024 by zhangyilalala

[BUG] Using different distributed strategies of Megatron-LM to train the llama3.1-8B model results in inconsistent training loss

#1324 opened Dec 16, 2024 by cailun01

[BUG] FSDP requires torch optimizer, not transformer_engine or apex

#1322 opened Dec 15, 2024 by prrathi

[QUESTION] I encountered the following issue when executing your command. What could be the cause? args.exit_on_missing_checkpoint is: True >> '--exit-on-missing-checkpoint' set ... exiting. <<

#1317 opened Dec 10, 2024 by Alinanini

[QUESTION]Does Megatron support tracing computation graphs with torch.fx?

#1315 opened Dec 7, 2024 by fy-j

[BUG] When using LLaVA with freeze-LM, training text only sample occurs error.

#1314 opened Dec 6, 2024 by liveseongho

[QUESTION] How to specify the implementation of Attention？

#1313 opened Dec 6, 2024 by renyinCheng001

[QUESTION] Gradient Propagation in backward pass

#1312 opened Dec 5, 2024 by arul-lm

[QUESTION]UnboundLocalError：local variable ‘output tensor’ referenced before assignmnet

#1311 opened Dec 5, 2024 by zmtttt

[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error.

#1310 opened Dec 5, 2024 by bphwk

[QUESTION]

#1308 opened Dec 2, 2024 by eliird

[BUG] The problem of splitting transformer layers when pipeline parallelism cannot be evenly divided.

#1304 opened Nov 27, 2024 by Baibaifan

[QUESTION] How to split the Transform layer when the pipeline is uneven?

#1303 opened Nov 27, 2024 by renyinCheng001

[QUESTION] Why is the initialization of the router and experts different in the MoE part?

#1302 opened Nov 27, 2024 by mxymxy77

[BUG] an illegal memory access was encountered in MOE-MLP(GroupGemm)

#1301 opened Nov 26, 2024 by hgdhrt

[BUG] validate_yaml() isn't in sync with arguments check

#1297 opened Nov 21, 2024 by pierric

Previous 1 2 3 4 5 6 7 Next

Previous Next

ProTip! Adding no:label will show everything without a label.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly