You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
For MOE GPT, the part of MLP(GroupGemm),
RuntimeError: CUDA error: an illegal memory access was encountered
The occurrence of randomness, even if the same data input, sometimes will report an error, sometimes will not report an error
To Reproduce
The occurrence of randomness, even if the same data input, sometimes will report an error, sometimes will not report an error
But there's always a problem at a certain point Expected behavior
Find the problem and fix it
Stack trace/logs
Bug:
losses_reduced = forward_backward_func(
File "/megatron/core/pipeline_parallel/schedules.py", line 1384, in forward_backward_pipelining_without_interleaving
output_tensor, num_tokens = forward_step(
File "/megatron/core/pipeline_parallel/schedules.py", line 219, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/pretrain_yuanvl.py", line 292, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/megatron/core/distributed/distributed_data_parallel.py", line 204, in forward
return self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/megatron/legacy/model/module.py", line 189, in forward
outputs = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/megatron/core/models/gpt/gpt_model.py", line 314, in forward
hidden_states = self.decoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/megatron/core/transformer/transformer_block.py", line 428, in forward
hidden_states = self._checkpointed_forward(
File "/megatron/core/transformer/transformer_block.py", line 316, in _checkpointed_forward
hidden_states, context = checkpoint_handler(
File "/megatron/core/transformer/transformer_block.py", line 299, in checkpoint_handler
return tensor_parallel.checkpoint(
File "/megatron/core/tensor_parallel/random.py", line 301, in checkpoint
return CheckpointFunction.apply(function, distribute_saved_activations, *args)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 569, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/megatron/core/tensor_parallel/random.py", line 240, in forward
outputs = run_function(*args)
File "/megatron/core/transformer/transformer_block.py", line 270, in custom_forward
hidden_states, context = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/megatron/core/transformer/transformer_layer.py", line 259, in forward
mlp_output_with_bias = self.mlp(pre_mlp_layernorm_output)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/megatron/core/transformer/moe/moe_layer.py", line 153, in forward
output, mlp_bias = custom_forward(hidden_states)
File "/megatron/core/transformer/moe/moe_layer.py", line 143, in custom_forward
(dispatched_input, tokens_per_expert) = self.token_dispatcher.token_permutation(
File "/megatron/core/transformer/moe/token_dispatcher.py", line 471, in token_permutation
tokens_per_expert = self.preprocess(indices)
File "/megatron/core/transformer/moe/token_dispatcher.py", line 440, in preprocess
self.global_input_tokens_local_experts_indices = torch.repeat_interleave(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Environment (please complete the following information):
Megatron-LM commit ID
PyTorch version:2.4.0
CUDA version:12.4
NCCL version:2.21.5-1
Proposed fix
No
Additional context
No
The text was updated successfully, but these errors were encountered:
Describe the bug
For MOE GPT, the part of MLP(GroupGemm),
The occurrence of randomness, even if the same data input, sometimes will report an error, sometimes will not report an error
To Reproduce
The occurrence of randomness, even if the same data input, sometimes will report an error, sometimes will not report an error
But there's always a problem at a certain point
Expected behavior
Find the problem and fix it
Stack trace/logs
Bug:
Environment (please complete the following information):
Proposed fix
No
Additional context
No
The text was updated successfully, but these errors were encountered: