-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document / support for using BFLOAT16 with (Xeon) TGI service #330
Comments
Wikipedia has nifty table listing the platforms currently supporting AVX512 with BF16 support: = Intel Cooper Lake & Sapphire Rapids, AMD Zen 4 & 5. On platform that do not support BF16 (e.g. Ice Lake), TGI seems to still work when BF16 type is specified, but slightly slower (due to a conversion step?). |
we can add info in docs to remind user close bf16 on specific machines. |
hi @eero-t the we push a pr to provide the recipe to label node and setup tgi with bfloat16, see #795 |
Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See: |
That's on our plan |
We add bf16 in Readme of docker |
The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3
TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:
However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.
This can be automated by using
node-feature-discovery
and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpuIt would be good to add some documentation and examples (e.g. comment lines in YAML) for this.
The text was updated successfully, but these errors were encountered: