Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document / support for using BFLOAT16 with (Xeon) TGI service #330

Closed
eero-t opened this issue Jun 26, 2024 · 6 comments
Closed

Document / support for using BFLOAT16 with (Xeon) TGI service #330

eero-t opened this issue Jun 26, 2024 · 6 comments
Assignees
Labels
aitce documentation Improvements or additions to documentation
Milestone

Comments

@eero-t
Copy link
Contributor

eero-t commented Jun 26, 2024

The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3

TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:

--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
         args:
         - --model-id
         - $(LLM_MODEL_ID)
+        - --dtype
+        - bfloat16
         #- "/data/Llama-2-7b-hf"
         # - "/data/Mistral-7B-Instruct-v0.2"
         # - --quantize

However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.

This can be automated by using node-feature-discovery and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu

It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.

@eero-t
Copy link
Contributor Author

eero-t commented Jun 28, 2024

Wikipedia has nifty table listing the platforms currently supporting AVX512 with BF16 support:
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX-512

= Intel Cooper Lake & Sapphire Rapids, AMD Zen 4 & 5.

On platform that do not support BF16 (e.g. Ice Lake), TGI seems to still work when BF16 type is specified, but slightly slower (due to a conversion step?).

@kevinintel
Copy link
Collaborator

we can add info in docs to remind user close bf16 on specific machines.

@lkk12014402
Copy link
Collaborator

The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3

TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:

--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
         args:
         - --model-id
         - $(LLM_MODEL_ID)
+        - --dtype
+        - bfloat16
         #- "/data/Llama-2-7b-hf"
         # - "/data/Mistral-7B-Instruct-v0.2"
         # - --quantize

However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.

This can be automated by using node-feature-discovery and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu

It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.

hi @eero-t the node-feature-discovery plugin can help select node(cpu) by labeling node with CPU features. But it needs create a pod.

we push a pr to provide the recipe to label node and setup tgi with bfloat16, see #795

@eero-t
Copy link
Contributor Author

eero-t commented Sep 12, 2024

Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:

@lianhao
Copy link
Collaborator

lianhao commented Sep 12, 2024

Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:

That's on our plan

@kevinintel
Copy link
Collaborator

We add bf16 in Readme of docker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aitce documentation Improvements or additions to documentation
Projects
Status: Done
Development

No branches or pull requests

5 participants