Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization Benchmark on different model architectures -- particularly MHA #120

Open
YixuanSeanZhou opened this issue Jan 3, 2025 · 0 comments

Comments

@YixuanSeanZhou
Copy link

Hi team,

I wonder whether it is possible to release some benchmarks on the expected quantization speedup for different mode architectures (beyond resnet).

Particularly, I am interested in using TRT modelopt + TensorRT to quantize a model that is similar to ViT (mostly I am exploring quantizing MHA). However, after performing quantization with onnx quant (based one this issue), the acceleration I see is less than what I expected. (observed: FP16 0.13ms, INT8 0.1ms, i would expect it to be 0.13/2 = 0.07ms)

I wonder what is the expected speedup we would achieve comparing to FP16?

I used the following code to generate the quantized onnx graph for vit.

import modelopt.onnx.quantization as moq
import numpy as np

calib_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

onnx_path = "vit_b_16.onnx" # exported from torch onnx export

import os
os.makedirs("onnx_quantized_vit", exist_ok=True)    

moq.quantize(
    onnx_path=onnx_path,
    calibration_data=calib_data,
    output_path="onnx_quantized_vit/quant.onnx",
    quantize_mode="int8",
)
@YixuanSeanZhou YixuanSeanZhou changed the title Quantization Benchmark on MHA Quantization Benchmark on different model architectures -- particularly MHA Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant