You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wonder whether it is possible to release some benchmarks on the expected quantization speedup for different mode architectures (beyond resnet).
Particularly, I am interested in using TRT modelopt + TensorRT to quantize a model that is similar to ViT (mostly I am exploring quantizing MHA). However, after performing quantization with onnx quant (based one this issue), the acceleration I see is less than what I expected. (observed: FP16 0.13ms, INT8 0.1ms, i would expect it to be 0.13/2 = 0.07ms)
I wonder what is the expected speedup we would achieve comparing to FP16?
I used the following code to generate the quantized onnx graph for vit.
import modelopt.onnx.quantization as moq
import numpy as np
calib_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
onnx_path = "vit_b_16.onnx" # exported from torch onnx export
import os
os.makedirs("onnx_quantized_vit", exist_ok=True)
moq.quantize(
onnx_path=onnx_path,
calibration_data=calib_data,
output_path="onnx_quantized_vit/quant.onnx",
quantize_mode="int8",
)
The text was updated successfully, but these errors were encountered:
YixuanSeanZhou
changed the title
Quantization Benchmark on MHA
Quantization Benchmark on different model architectures -- particularly MHA
Jan 3, 2025
Hi team,
I wonder whether it is possible to release some benchmarks on the expected quantization speedup for different mode architectures (beyond resnet).
Particularly, I am interested in using TRT modelopt + TensorRT to quantize a model that is similar to ViT (mostly I am exploring quantizing MHA). However, after performing quantization with onnx quant (based one this issue), the acceleration I see is less than what I expected. (observed: FP16 0.13ms, INT8 0.1ms, i would expect it to be 0.13/2 = 0.07ms)
I wonder what is the expected speedup we would achieve comparing to FP16?
I used the following code to generate the quantized onnx graph for vit.
The text was updated successfully, but these errors were encountered: