Document / support for using BFLOAT16 with (Xeon) TGI service #330

eero-t · 2024-06-26T17:15:09Z

The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3

TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:

--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
         args:
         - --model-id
         - $(LLM_MODEL_ID)
+        - --dtype
+        - bfloat16
         #- "/data/Llama-2-7b-hf"
         # - "/data/Mistral-7B-Instruct-v0.2"
         # - --quantize

However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.

This can be automated by using node-feature-discovery and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu

It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.

The text was updated successfully, but these errors were encountered:

eero-t · 2024-06-28T08:56:33Z

Wikipedia has nifty table listing the platforms currently supporting AVX512 with BF16 support:
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX-512

= Intel Cooper Lake & Sapphire Rapids, AMD Zen 4 & 5.

On platform that do not support BF16 (e.g. Ice Lake), TGI seems to still work when BF16 type is specified, but slightly slower (due to a conversion step?).

kevinintel · 2024-08-07T06:42:23Z

we can add info in docs to remind user close bf16 on specific machines.

lkk12014402 · 2024-09-12T05:50:28Z

The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3

TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:
--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
         args:
         - --model-id
         - $(LLM_MODEL_ID)
+        - --dtype
+        - bfloat16
         #- "/data/Llama-2-7b-hf"
         # - "/data/Mistral-7B-Instruct-v0.2"
         # - --quantize
However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.

This can be automated by using node-feature-discovery and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu

It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.

hi @eero-t the node-feature-discovery plugin can help select node(cpu) by labeling node with CPU features. But it needs create a pod.

we push a pr to provide the recipe to label node and setup tgi with bfloat16, see #795

eero-t · 2024-09-12T08:39:20Z

Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:

lianhao · 2024-09-12T08:58:32Z

Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:

https://github.com/opea-project/GenAIInfra/pull/402/files#diff-9ff19985b33716e4f25db59c4f8a9c5611b54034dda7108860e9131b30444b8b

HPA improvements GenAIInfra#386 (comment)

That's on our plan

kevinintel · 2024-09-19T02:03:57Z

We add bf16 in Readme of docker

yinghu5 added documentation Improvements or additions to documentation aitce labels Jul 1, 2024

yinghu5 self-assigned this Jul 1, 2024

eero-t mentioned this issue Jul 15, 2024

Feature request add AMX instructions for FAISS, in the retrievers opea-project/GenAIComps#307

Open

yinghu5 assigned kevinintel Jul 25, 2024

yinghu5 added this to the v0.9 milestone Jul 25, 2024

yinghu5 added this to OPEA Jul 25, 2024

yinghu5 modified the milestones: v0.9, v1.0 Aug 20, 2024

lkk12014402 mentioned this issue Sep 12, 2024

add tgi bf16 setup on CPU k8s. #795

Merged

yinghu5 closed this as completed Sep 20, 2024

github-project-automation bot moved this to Done in OPEA Sep 20, 2024

eero-t mentioned this issue Dec 2, 2024

OPEA docker images take a lot of disk space opea-project/GenAIComps#921

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document / support for using BFLOAT16 with (Xeon) TGI service #330

Document / support for using BFLOAT16 with (Xeon) TGI service #330

eero-t commented Jun 26, 2024

eero-t commented Jun 28, 2024

kevinintel commented Aug 7, 2024

lkk12014402 commented Sep 12, 2024

eero-t commented Sep 12, 2024

lianhao commented Sep 12, 2024

kevinintel commented Sep 19, 2024

Document / support for using BFLOAT16 with (Xeon) TGI service #330

Document / support for using BFLOAT16 with (Xeon) TGI service #330

Comments

eero-t commented Jun 26, 2024

eero-t commented Jun 28, 2024

kevinintel commented Aug 7, 2024

lkk12014402 commented Sep 12, 2024

eero-t commented Sep 12, 2024

lianhao commented Sep 12, 2024

kevinintel commented Sep 19, 2024