Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How to specify the implementation of Attention? #1313

Open
renyinCheng001 opened this issue Dec 6, 2024 · 2 comments
Open

[QUESTION] How to specify the implementation of Attention? #1313

renyinCheng001 opened this issue Dec 6, 2024 · 2 comments

Comments

@renyinCheng001
Copy link

Hi, All~

There are currently 4 ways to calculate Attention (Ref: https://pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend

  • Math (original Attention implementation, )

  • FLASH_ATTENTION

  • EFFICIENT_ATTENTION

  • CUDNN_ATTENTION

At present, I see that the Flash-Attention calculation method seems to be used by default.

Is it possible to specify other Attention calculation methods, such as EFFICIENT_ATTENTION?

My environment is as follows:

Driver Version : 535.183.01
CUDA Version : 12.4.0rc7+3.ge75c8a9.dirty
Python version : 3.10.12
PyTorch version : 2.3.0a0+6ddf5cf85e.nv24.4

Thanks!

@LeoAtlanto
Copy link

https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/attention/attention.ipynb
You can refer to this using Transformer Engine, where origin/flash/cudnn attention could be specified with some environment variables.

@yaox12
Copy link
Contributor

yaox12 commented Dec 16, 2024

Megatron doesn't use torch.nn.attention. If you're specifying --transformer-impl transformer_engine, you can set the env vars

export NVTE_DEBUG=1
export NVTE_DEBUG_LEVEL=2

to log which attention backend is selected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants