Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] an illegal memory access was encountered in MOE-MLP(GroupGemm) #1301

Open
hgdhrt opened this issue Nov 26, 2024 · 0 comments
Open

[BUG] an illegal memory access was encountered in MOE-MLP(GroupGemm) #1301

hgdhrt opened this issue Nov 26, 2024 · 0 comments

Comments

@hgdhrt
Copy link

hgdhrt commented Nov 26, 2024

Describe the bug
For MOE GPT, the part of MLP(GroupGemm),

RuntimeError: CUDA error: an illegal memory access was encountered

The occurrence of randomness, even if the same data input, sometimes will report an error, sometimes will not report an error

To Reproduce
The occurrence of randomness, even if the same data input, sometimes will report an error, sometimes will not report an error
But there's always a problem at a certain point
Expected behavior
Find the problem and fix it

Stack trace/logs
Bug:

     losses_reduced = forward_backward_func(
   File "/megatron/core/pipeline_parallel/schedules.py", line 1384, in forward_backward_pipelining_without_interleaving
     output_tensor, num_tokens = forward_step(
   File "/megatron/core/pipeline_parallel/schedules.py", line 219, in forward_step
     output_tensor, loss_func = forward_step_func(data_iterator, model)
   File "/pretrain_yuanvl.py", line 292, in forward_step
     output_tensor = model(tokens, position_ids, attention_mask,
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/core/distributed/distributed_data_parallel.py", line 204, in forward
     return self.module(*inputs, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/legacy/model/module.py", line 189, in forward
     outputs = self.module(*inputs, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/core/models/gpt/gpt_model.py", line 314, in forward
     hidden_states = self.decoder(
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/core/transformer/transformer_block.py", line 428, in forward
     hidden_states = self._checkpointed_forward(
   File "/megatron/core/transformer/transformer_block.py", line 316, in _checkpointed_forward
     hidden_states, context = checkpoint_handler(
   File "/megatron/core/transformer/transformer_block.py", line 299, in checkpoint_handler
     return tensor_parallel.checkpoint(
   File "/megatron/core/tensor_parallel/random.py", line 301, in checkpoint
     return CheckpointFunction.apply(function, distribute_saved_activations, *args)
   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 569, in apply
     return super().apply(*args, **kwargs)  # type: ignore[misc]
   File "/megatron/core/tensor_parallel/random.py", line 240, in forward
     outputs = run_function(*args)
   File "/megatron/core/transformer/transformer_block.py", line 270, in custom_forward
     hidden_states, context = layer(
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/core/transformer/transformer_layer.py", line 259, in forward
     mlp_output_with_bias = self.mlp(pre_mlp_layernorm_output)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "/megatron/core/transformer/moe/moe_layer.py", line 153, in forward
     output, mlp_bias = custom_forward(hidden_states)
   File "/megatron/core/transformer/moe/moe_layer.py", line 143, in custom_forward
     (dispatched_input, tokens_per_expert) = self.token_dispatcher.token_permutation(
   File "/megatron/core/transformer/moe/token_dispatcher.py", line 471, in token_permutation
     tokens_per_expert = self.preprocess(indices)
   File "/megatron/core/transformer/moe/token_dispatcher.py", line 440, in preprocess
     self.global_input_tokens_local_experts_indices = torch.repeat_interleave(
 RuntimeError: CUDA error: an illegal memory access was encountered
 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
 For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Environment (please complete the following information):

  • Megatron-LM commit ID
  • PyTorch version:2.4.0
  • CUDA version:12.4
  • NCCL version:2.21.5-1

Proposed fix
No

Additional context
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant