[BUG] an illegal memory access was encountered in MOE-MLP(GroupGemm) #1301

hgdhrt · 2024-11-26T09:52:21Z

Describe the bug
For MOE GPT, the part of MLP(GroupGemm)，

RuntimeError: CUDA error: an illegal memory access was encountered

The occurrence of randomness, even if the same data input, sometimes will report an error, sometimes will not report an error

To Reproduce
The occurrence of randomness, even if the same data input, sometimes will report an error, sometimes will not report an error
But there's always a problem at a certain point
Expected behavior
Find the problem and fix it

Stack trace/logs
Bug:

     losses_reduced = forward_backward_func(
   File "/megatron/core/pipeline_parallel/schedules.py", line 1384, in forward_backward_pipelining_without_interleaving
     output_tensor, num_tokens = forward_step(
   File "/megatron/core/pipeline_parallel/schedules.py", line 219, in forward_step
     output_tensor, loss_func = forward_step_func(data_iterator, model)
   File "/pretrain_yuanvl.py", line 292, in forward_step
     output_tensor = model(tokens, position_ids, attention_mask,
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/core/distributed/distributed_data_parallel.py", line 204, in forward
     return self.module(*inputs, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/legacy/model/module.py", line 189, in forward
     outputs = self.module(*inputs, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/core/models/gpt/gpt_model.py", line 314, in forward
     hidden_states = self.decoder(
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/core/transformer/transformer_block.py", line 428, in forward
     hidden_states = self._checkpointed_forward(
   File "/megatron/core/transformer/transformer_block.py", line 316, in _checkpointed_forward
     hidden_states, context = checkpoint_handler(
   File "/megatron/core/transformer/transformer_block.py", line 299, in checkpoint_handler
     return tensor_parallel.checkpoint(
   File "/megatron/core/tensor_parallel/random.py", line 301, in checkpoint
     return CheckpointFunction.apply(function, distribute_saved_activations, *args)
   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 569, in apply
     return super().apply(*args, **kwargs)  # type: ignore[misc]
   File "/megatron/core/tensor_parallel/random.py", line 240, in forward
     outputs = run_function(*args)
   File "/megatron/core/transformer/transformer_block.py", line 270, in custom_forward
     hidden_states, context = layer(
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/core/transformer/transformer_layer.py", line 259, in forward
     mlp_output_with_bias = self.mlp(pre_mlp_layernorm_output)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/core/transformer/moe/moe_layer.py", line 153, in forward
     output, mlp_bias = custom_forward(hidden_states)
   File "/megatron/core/transformer/moe/moe_layer.py", line 143, in custom_forward
     (dispatched_input, tokens_per_expert) = self.token_dispatcher.token_permutation(
   File "/megatron/core/transformer/moe/token_dispatcher.py", line 471, in token_permutation
     tokens_per_expert = self.preprocess(indices)
   File "/megatron/core/transformer/moe/token_dispatcher.py", line 440, in preprocess
     self.global_input_tokens_local_experts_indices = torch.repeat_interleave(
 RuntimeError: CUDA error: an illegal memory access was encountered
 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
 For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Environment (please complete the following information):

Megatron-LM commit ID
PyTorch version:2.4.0
CUDA version:12.4
NCCL version:2.21.5-1

Proposed fix
No

Additional context
No

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] an illegal memory access was encountered in MOE-MLP(GroupGemm) #1301

[BUG] an illegal memory access was encountered in MOE-MLP(GroupGemm) #1301

hgdhrt commented Nov 26, 2024

[BUG] an illegal memory access was encountered in MOE-MLP(GroupGemm) #1301

[BUG] an illegal memory access was encountered in MOE-MLP(GroupGemm) #1301

Comments

hgdhrt commented Nov 26, 2024