We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi, All~
There are currently 4 ways to calculate Attention (Ref: https://pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend
Math (original Attention implementation, )
FLASH_ATTENTION
EFFICIENT_ATTENTION
CUDNN_ATTENTION
At present, I see that the Flash-Attention calculation method seems to be used by default.
Is it possible to specify other Attention calculation methods, such as EFFICIENT_ATTENTION?
My environment is as follows:
Driver Version : 535.183.01 CUDA Version : 12.4.0rc7+3.ge75c8a9.dirty Python version : 3.10.12 PyTorch version : 2.3.0a0+6ddf5cf85e.nv24.4
Thanks!
The text was updated successfully, but these errors were encountered:
https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/attention/attention.ipynb You can refer to this using Transformer Engine, where origin/flash/cudnn attention could be specified with some environment variables.
Sorry, something went wrong.
Megatron doesn't use torch.nn.attention. If you're specifying --transformer-impl transformer_engine, you can set the env vars
torch.nn.attention
--transformer-impl transformer_engine
export NVTE_DEBUG=1 export NVTE_DEBUG_LEVEL=2
to log which attention backend is selected.
No branches or pull requests
Hi, All~
There are currently 4 ways to calculate Attention (Ref: https://pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend
Math (original Attention implementation, )
FLASH_ATTENTION
EFFICIENT_ATTENTION
CUDNN_ATTENTION
At present, I see that the Flash-Attention calculation method seems to be used by default.
Is it possible to specify other Attention calculation methods, such as EFFICIENT_ATTENTION?
My environment is as follows:
Thanks!
The text was updated successfully, but these errors were encountered: