Skip to content

Releases: HabanaAI/vllm-fork

v0.6.4.post2+Gaudi-1.19.0

19 Dec 13:34
f48f9a5
Compare
Choose a tag to compare

vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.19.0

Requirements and Installation

Please follow the instructions provided in the Gaudi Installation Guide to set up the execution environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

  • Ubuntu 22.04 LTS OS
  • Python 3.10
  • Intel Gaudi accelerator
  • Intel Gaudi software version 1.19.0 and above

Quick Start Using Dockerfile

Set up the container with latest release of Gaudi Software Suite using the Dockerfile:

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.6.4.post2+Gaudi-1.19.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature Description References
Offline batched inference Offline inference using LLM class from vLLM Python API Quickstart
Example
Online inference via OpenAI-Compatible Server Online inference using HTTP server that implements OpenAI Chat and Completions API Documentation
Example
HPU autodetection HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. N/A
Custom Intel Gaudi operator implementations vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. N/A
Tensor parallel inference (single-node multi-HPU) vLLM HPU backend support multi-HPU inference across a single node with tensor parallelism with Ray and HCCL. Documentation
Example
HCCL reference
Inference with HPU Graphs vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time, to be later replayed during inference, significantly reducing host overheads. Documentation
vLLM HPU backend execution modes
Optimization guide
Inference with torch.compile (experimental) vLLM HPU backend experimentally supports inference with torch.compile. vLLM HPU backend execution modes
Attention with Linear Biases (ALiBi) vLLM HPU backend supports models utilizing Attention with Linear Biases (ALiBi) such as mpt-7b. vLLM supported models
INC quantization vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). Documentation
LoRA/MultiLoRA support vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. Documentation
Example
vLLM supported models
Multi-step scheduling support vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. Feature RFC
Automatic prefix caching (experimental) vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. Documentation
Details
Speculative decoding (experimental) vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurabie via standard --speculative_model and --num_speculative_tokens parameters. Documentation
Example

Unsupported Features

  • Beam search
  • AWQ quantization
  • Prefill chunking (mixed-batch inferencing)

Supported Configurations

The following configurations have been validated to be function with Gaudi2 devices. Configurations that are not listed may or may not work.

Read more

v0.5.3.post1+Gaudi-1.18.0

08 Oct 13:35
8492294
Compare
Choose a tag to compare

vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.18.0

Requirements and Installation

Please follow the instructions provided in the Gaudi Installation Guide to set up the environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

  • OS: Ubuntu 22.04 LTS
  • Python: 3.10
  • Intel Gaudi accelerator
  • Intel Gaudi software version 1.18.0

To verify that the Intel Gaudi software was correctly installed, run:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core and habanalabs-thunk are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to Intel Gaudi Software Stack Verification for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image:

$ docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.5.3.post1+Gaudi-1.18.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

  • Offline batched inference
  • Online inference via OpenAI-Compatible Server
  • HPU autodetection - no need to manually select device within vLLM
  • Paged KV cache with algorithms enabled for Intel Gaudi accelerators
  • Custom Intel Gaudi implementations of Paged Attention, KV cache ops, prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding
  • Tensor parallelism support for multi-card inference
  • Inference with HPU Graphs for accelerating low-batch latency and throughput
  • Attention with Linear Biases (ALiBi)
  • LoRA adapters
  • Quantization with INC

Unsupported Features

  • Beam search
  • Prefill chunking (mixed-batch inferencing)

Supported Configurations

The following configurations have been validated to be function with Gaudi2 devices. Configurations that are not listed may or may not work.

Performance Tuning

Execution modes

Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via PT_HPU_LAZY_MODE environment variable), and --enforce-eager flag.

PT_HPU_LAZY_MODE enforce_eager execution mode
0 0 torch.compile
0 1 PyTorch eager mode
1 0 HPU Graphs
1 1 PyTorch lazy mode

Warning

In 1.18.0, all modes utilizing PT_HPU_LAZY_MODE=0 are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.

Bucketing mechanism

Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. Intel Gaudi Graph Compiler is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution. In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - batch_size and sequence_length.

Note

Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.

Bucketing ranges are determined with 3 parameters - min, step and max. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:

INFO 08-01 21:37:59 habana_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-01 21:37:59 habana_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (...
Read more

v0.6.2

30 Sep 08:37
7193774
Compare
Choose a tag to compare
v0.6.2 Pre-release
Pre-release

Full Changelog: v0.5.4...v0.6.2

v0.5.3.post1-Gaudi-1.17.0

14 Aug 13:11
1e0e492
Compare
Choose a tag to compare

vLLM with Intel® Gaudi® AI Accelerators

This README provides instructions on running vLLM with Intel Gaudi devices.

Requirements and Installation

Please follow the instructions provided in the Gaudi Installation Guide to set up the environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

  • OS: Ubuntu 22.04 LTS
  • Python: 3.10
  • Intel Gaudi accelerator
  • Intel Gaudi software version 1.17.0

To verify that the Intel Gaudi software was correctly installed, run:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core and habanalabs-thunk are installed
$ pip list | habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml, habana-media-loader and habana_quantization_toolkit are installed

Refer to Intel Gaudi Software Stack Verification for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image:

$ docker pull vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.5.3.post1-Gaudi-1.17.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

  • Offline batched inference
  • Online inference via OpenAI-Compatible Server
  • HPU autodetection - no need to manually select device within vLLM
  • Paged KV cache with algorithms enabled for Intel Gaudi accelerators
  • Custom Intel Gaudi implementations of Paged Attention, KV cache ops, prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding
  • Tensor parallelism support for multi-card inference
  • Inference with HPU Graphs for accelerating low-batch latency and throughput

Unsupported Features

  • Beam search
  • LoRA adapters
  • Attention with Linear Biases (ALiBi)
  • Quantization (AWQ, FP8 E5M2, FP8 E4M3)
  • Prefill chunking (mixed-batch inferencing)

Supported Configurations

The following configurations have been validated to be function with Gaudi2 devices. Configurations that are not listed may or may not work.

Performance Tuning

Execution modes

Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via PT_HPU_LAZY_MODE environment variable), and --enforce-eager flag.

PT_HPU_LAZY_MODE enforce_eager execution mode
0 0 torch.compile
0 1 PyTorch eager mode
1 0 HPU Graphs
1 1 PyTorch lazy mode

Warning

In 1.17.0, all modes utilizing PT_HPU_LAZY_MODE=0 are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.17.0, please use HPU Graphs, or PyTorch lazy mode.

Bucketing mechanism

Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. Intel Gaudi Graph Compiler is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution. In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - batch_size and sequence_length.

Note

Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.

Bucketing ranges are determined with 3 parameters - min, step and max. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:

INFO 08-01 21:37:59 habana_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-01 21:37:59 habana_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 08-01 21:37:59 habana_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
INFO 08-01 21:37:59 habana_model_runner.py:509] Generated 48 decode buckets: [(1, 12...
Read more

v0.4.2-Gaudi-1.16.0

23 May 12:09
Compare
Choose a tag to compare

vLLM with Intel® Gaudi® 2 AI Accelerators

This README provides instructions on running vLLM with Intel Gaudi devices.

Requirements and Installation

Please follow the instructions provided in the Gaudi Installation Guide to set up the environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.

Note

In this release (1.16.0), we are only targeting functionality and accuracy. Performance will be improved in next releases.

Requirements

  • OS: Ubuntu 22.04 LTS
  • Python: 3.10
  • Intel Gaudi 2 accelerator
  • Intel Gaudi software version 1.16.0

To verify that the Intel Gaudi software was correctly installed, run:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core and habanalabs-thunk are installed
$ pip list | habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml, habana-media-loader and habana_quantization_toolkit are installed

Refer to Intel Gaudi Software Stack Verification for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image:

$ docker pull vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest

Build and Install vLLM-fork

To build and install vLLM-fork from source, run:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.4.2-Gaudi-1.16.0
$ pip install -e .  # This may take 5-10 minutes.

Supported Features

  • Offline batched inference
  • Online inference via OpenAI-Compatible Server
  • HPU autodetection - no need to manually select device within vLLM
  • Paged KV cache with algorithms enabled for Intel Gaudi 2 accelerators
  • Custom Intel Gaudi implementations of Paged Attention, KV cache ops, prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding
  • Tensor parallelism support for multi-card inference
  • Inference with HPU Graphs for accelerating low-batch latency and throughput

Unsupported Features

  • Beam search
  • LoRA adapters
  • Attention with Linear Biases (ALiBi)
  • Quantization (AWQ, FP8 E5M2, FP8 E4M3)
  • Prefill chunking (mixed-batch inferencing)

Supported Configurations

The following configurations have been validated to be function with Gaudi devices. Configurations that are not listed may or may not work.

Performance Tips

  • We recommend running inference on Gaudi 2 with block_size of 128 for BF16 data type. Using default values (16, 32) might lead to sub-optimal performance due to Matrix Multiplication Engine under-utilization (see Gaudi Architecture).
  • For max throughput on Llama 7B, we recommend running with batch size of 128 or 256 and max context length of 2048 with HPU Graphs enabled. If you encounter out-of-memory issues, see troubleshooting section.

Troubleshooting: Tweaking HPU Graphs

If you experience device out-of-memory issues or want to attempt inference at higher batch sizes, try tweaking HPU Graphs by following the below:

  • Tweak gpu_memory_utilization knob. It will decrease the allocation of KV cache, leaving some headroom for
    capturing graphs with larger batch size. By default gpu_memory_utilization is set to 0.9. It attempts to allocate ~90% of HBM left for KV cache after short profiling run. Note that decreasing reduces the number of KV cache blocks you have available, and therefore reduces the effective maximum number of tokens you can handle at a given time.

  • If this method is not efficient, you can disable HPUGraph completely. With HPU Graphs disabled, you are trading latency and throughput at lower batches for potentially higher throughput on higher batches. You can do that by adding --enforce-eager flag to server (for online inference), or by passing enforce_eager=True argument to LLM constructor (for offline inference).