Skip to content

Latest commit

 

History

History
52 lines (34 loc) · 2.41 KB

README.md

File metadata and controls

52 lines (34 loc) · 2.41 KB

Post-training quantization (PTQ)

What's This Example Folder About?

This example demonstrates how the Model Optimizer does PTQ quantization on the LLM part of a VLM (visual language model) and deploys the VLM with TensorRT and TensorRT-LLM..

Model Quantization and TRT LLM Conversion

All-in-one Scripts for Quantization and Building

Please refer to the llm_ptq/README.md about the details of model quantization.

The following scripts provide an all-in-one and step-by-step model quantization example for Llava, VILA and Phi-3-vision models. The quantization format and the number of GPUs will be supplied as inputs to these scripts. By default, we build the engine for the fp8 format and 1 GPU.

cd <this example folder>

For the Hugging Face models:

For Llava:

git clone https://huggingface.co/llava-hf/llava-1.5-7b-hf
scripts/huggingface_example.sh --type llava --model llava-1.5-7b-hf --quant [fp8|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]

For VILA models like VILA1.5-3b:

git clone https://huggingface.co/Efficient-Large-Model/VILA1.5-3b vila1.5-3b
scripts/huggingface_example.sh --type vila --model vila1.5-3b --quant [fp8|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]

For Phi-3-vision:

git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
scripts/huggingface_example.sh --type phi --model Phi-3-vision-128k-instruct --quant [fp8|int8_sq|int4_awq|w4a8_awq]

Model Support List

Model type fp8 int8_sq int4_awq w4a8_awq1
Llava llava Yes Yes Yes Yes
VILA vila Yes Yes Yes Yes
Phi-3-vision phi Yes Yes Yes Yes

1.The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty. Only available on sm90 GPUs

The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying hf_ptq.py and disabling the KV cache quantization or using the QAT instead.