[QUESTION] How to specify the implementation of Attention？ #1313

renyinCheng001 · 2024-12-06T06:31:35Z

Hi, All~

At present, I see that the Flash-Attention calculation method seems to be used by default.

Is it possible to specify other Attention calculation methods, such as EFFICIENT_ATTENTION?

My environment is as follows:

Driver Version : 535.183.01
CUDA Version : 12.4.0rc7+3.ge75c8a9.dirty
Python version : 3.10.12
PyTorch version : 2.3.0a0+6ddf5cf85e.nv24.4

Thanks!

The text was updated successfully, but these errors were encountered:

LeoAtlanto · 2024-12-12T04:40:14Z

https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/attention/attention.ipynb
You can refer to this using Transformer Engine, where origin/flash/cudnn attention could be specified with some environment variables.

yaox12 · 2024-12-16T03:41:26Z

Megatron doesn't use torch.nn.attention. If you're specifying --transformer-impl transformer_engine, you can set the env vars

export NVTE_DEBUG=1
export NVTE_DEBUG_LEVEL=2

to log which attention backend is selected.

Provide feedback