Name		Name	Last commit message	Last commit date
parent directory ..
llama3		llama3
lora		lora
mistral		mistral
Readme.md		Readme.md
config.properties		config.properties
requirements.txt		requirements.txt

Readme.md

Example showing inference with vLLM

This folder contains multiple demonstrations showcasing the integration of vLLM Engine with TorchServe, running inference with continuous batching. vLLM achieves high throughput using PagedAttention. More details can be found here. The vLLM integration uses our new asynchronous worker communication mode which decoupled communication between frontend and backend from running the actual inference. By using this new feature TorchServe is capable to feed incoming requests into the vLLM engine while asynchronously running the engine in the backend. As long as a single request is inside the engine it will continue to run and asynchronously stream out the results until the request is finished. New requests are added to the engine in a continuous fashion similar to the continuous batching mode shown in other examples. For all examples distributed inference can be enabled by following the instruction here

demo1: Meta-Llama3
demo2: Mistral
demo3: lora

Supported vLLM Configuration

LLMEngine configuration: vLLM EngineArgs is defined in the section of handler/vllm_engine_config of model-config.yaml.
Sampling parameters for text generation: vLLM SamplingParams is defined in the JSON format, for example, prompt.json.

Distributed Inference

All examples can be easily distributed over multiple GPUs by enabling tensor parallelism in vLLM. To enable distributed inference the following additions need to made to the model-config.yaml of the examples where 4 is the number of desired GPUs to use for the inference:

# TorchServe frontend parameters
...
parallelType: "custom"
parallelLevel: 4

handler:
    ...
    vllm_engine_config:
        ...
        tensor_parallel_size: 4

Multi-worker Note:

While this example in theory works with multiple workers it would distribute the incoming requests in a round robin fashion which might lead to non optimal worker/hardware utilization. It is therefore advised to only use a single worker per engine and utilize tensor parallelism to distribute the model over multiple GPUs as described in the previous section. This will result in better hardware utilization and inference performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm

vllm

Readme.md

Example showing inference with vLLM

Supported vLLM Configuration

Distributed Inference

Multi-worker Note:

Files

vllm

Directory actions

More options

Directory actions

More options

Latest commit

History

vllm

Folders and files

parent directory

Readme.md

Example showing inference with vLLM

Supported vLLM Configuration

Distributed Inference

Multi-worker Note: