Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Compute reduction workspace size with accumulation type (#944)
Summary: Pull Request resolved: #944 The reduction ops relying on CUTLASS reduction in the back-end pre-compute the workspace size in the front-end. Previously, the workspace size was computed in terms of the input type. This, however is not consistent with how CUTLASS computes the workspace size. See the `ElementCompute` [here](https://github.com/NVIDIA/cutlass/blob/ff02da266713bd3365aed65c552412e126c040cb/include/cutlass/reduction/device/tensor_reduce_affine_strided.h#L223), which is actually accumulation type. As a result, when `float32` accumulation was used for `float16` or `bfloat16` input type, the workspace size pre-computed was twice smaller than required. In this diff, the workspace size pre-computation is modified to be done in terms of the accumulation type. As the `use_fp16_acc` flag is set in the backend `Target`, if the target is not set by the time the workspace size should be pre-computed, the `float32` accumulation type is used conservatively. Reviewed By: ipiszy, chenyang78 Differential Revision: D50060329 fbshipit-source-id: 554d29a37cc9e15a72b2d1c36b781b12ec6313e3
- Loading branch information