Quantization Benchmark on different model architectures -- particularly MHA #120

YixuanSeanZhou · 2025-01-03T18:30:30Z

Hi team,

I wonder whether it is possible to release some benchmarks on the expected quantization speedup for different mode architectures (beyond resnet).

Particularly, I am interested in using TRT modelopt + TensorRT to quantize a model that is similar to ViT (mostly I am exploring quantizing MHA). However, after performing quantization with onnx quant (based one this issue), the acceleration I see is less than what I expected. (observed: FP16 0.13ms, INT8 0.1ms, i would expect it to be 0.13/2 = 0.07ms)

I wonder what is the expected speedup we would achieve comparing to FP16?

I used the following code to generate the quantized onnx graph for vit.

import modelopt.onnx.quantization as moq
import numpy as np

calib_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

onnx_path = "vit_b_16.onnx" # exported from torch onnx export

import os
os.makedirs("onnx_quantized_vit", exist_ok=True)    

moq.quantize(
    onnx_path=onnx_path,
    calibration_data=calib_data,
    output_path="onnx_quantized_vit/quant.onnx",
    quantize_mode="int8",
)

The text was updated successfully, but these errors were encountered:

YixuanSeanZhou changed the title ~~Quantization Benchmark on MHA~~ Quantization Benchmark on different model architectures -- particularly MHA Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization Benchmark on different model architectures -- particularly MHA #120

Quantization Benchmark on different model architectures -- particularly MHA #120

YixuanSeanZhou commented Jan 3, 2025

Quantization Benchmark on different model architectures -- particularly MHA #120

Quantization Benchmark on different model architectures -- particularly MHA #120

Comments

YixuanSeanZhou commented Jan 3, 2025