-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPA improvements #386
HPA improvements #386
Conversation
1989441
to
3b451ba
Compare
Force pushed pre-commit CI fixes (as there were no review comments yet). |
If |
Yes, that would be a good idea. |
518b519
to
d096521
Compare
As to the generated manifests: https://github.com/opea-project/GenAIInfra/tree/main/microservices-connector/config/HPA/ I think those should be thrown away rather than updated. Using HPA with GMC when inferencing is done on CPU does not make much sense. GMC allows e.g. changing model at run-time, but when scaling the CPU services, it means that deployment resource requests etc should be updated accordingly (based on used model, data type used with it, TGI/TEI versions and underlying node resources), and I don't think GMC has any support for that. (Using HPA with GMC for Gaudi accelerated services would be a bit different matter, but current HPA rules are just for CPU charts.) |
@eero-t To make HPA support in GMC is the target and @ichbinblau is working on this already. GMC allows to change the model dynamically but it will also support static pipiline. |
Not having used Helm before, I had not realized that This got me thinking... If settings needed for scaling (resource requests corresponding to specified model, probe timings etc) were in separate files, user could enable HPA without needing to modify Enabling HPA would then look something like this, for CPU:
And for Gaudi:
Thoughts? |
Note: above does not fix the issue for (CPU) scaling, of scaled pods' resource requests needing to match their resource usage with the LLM model specified for them => overriding the model with a global Helm value, or by GMC, is not really a good idea as then there's a mismatch between those. When Gaudi or GPU is used, that's less of a issue, as long as single device per Pod is OK. If larger models would require multiple devices, or smaller models would require (for better utilization / cost reasons) sharing GPUs between pods, then changing models would also be bad idea, if resource requests are not changed accordingly. |
In helm chart, Since in component helm charts(those charts under common directory, HPA is disabled by default, i.e. in their So |
If user does NOT use Btw. Theresa just mentioned that Helm refuses to install PrometheusAdapter custom metrics config (for ChatQnA) over configMap that has been installed by another Helm chart, the Prometheus one... (I hadn't noticed this because I use Helm to generate manifest, and then apply that after reviewing the results, I don't install things directly with Helm. Also my Prometheus install comes from => I'm not sure how to best handle that for 1.0 milestone. I could test doing all installs directly with Helm and find some way to workaround any issues, but that will take at least few days. Another option is to document what I've actually tested ie. use Helm just to produce manifests, and installing those...
Because suitable subcomponent resource requests & probe timings differ between Gaudi and CPU (besides depending on model), and are needed for scaling to work in general (not just for HPA), I think they should be in separate file from generic HPA setup. I.e. top-level chart would have
Model should also be set in top-level chart device values file, not |
FYI: I filed Prometheus-operator + Prometheus-adapter tickets on getting proper support for updating Prometheus-adapter, needed to properly fix custom metrics installation:
That would allow reasonable automation for custom metrics, so I'd appreciate thumb-ups in the tickets. :-) |
656aee0
to
cba8b18
Compare
I split HPA & CPU values to their own values files, and fixed merge conflict with PS. In some later PR, extra variables can be added for HPA rules and TGI command line options, so that they match underlying HW better. (Gaudi TGI can start much faster, so faster HPA scaling values could be used for Gaudi. There are Gaudi specific options that can provide significantly faster throughput for it, and I really hope there being also TGI options that can reduce its excruciatingly slow startup with CPU warmup. Newer CPUs could also use bfloat16 instead of float32, which would improve TGI perf a bit in addition to halving its memory usage.) |
Interestingly CI automatically started running tests for the new values files:
However, There was also unrelated (Gaudi guardrails) test failure due to: |
The top-leve values.yaml is meant for cpu cases without HPA. We want users to be able to install helm chart without applying addtional value files on any standard xeon servers if they want to have a try. How about we do the followings:
As for the PrometheusAdapter custom metrics config, maybe we should have it documented here before prometheus-operator/kube-prometheus#2504 is resolved |
Please contact @daisy-ycguo about the CI to skip some value files in CI. We might need to come up a unified way of how to skip some cases, now we have nv-values.yaml to be skipped, and I expect more to come. |
This looks sane to me. |
cba8b18
to
b43e058
Compare
As CPU values being in separate top-level chart (instead of subcharts) was fine, I dropped the resource & probe timing updates to |
I thought it was intended for Gaudi because it worked so badly on Xeon... Without suitable probe timings and resources, out-of-the-box Xeon experience is rather awful, if user happens to run multiple different services, or does However, according to comments in TGI Therefore I put CPU timings + resource values to a separate Different CPU resource requests are needed for different models and different data types. I.e. if things should work out of the box, there need to be file per set of models, and data types specified for TGI & TEI, which in future could look something like this:
Note that because it's not possible to share Gaudi, case for that is simpler. One needs different resource spec only for models that do not fit into single Gaudi, but need to be sharded over several of them. If model needs to be split, only then CPU side resource usage can be relevant (due to CPU side DeepSpeed overhead).
Extra file is needed for Gaudi and Nvidia, so it would be consistent to have one also for CPU.
Whether one uses Gaudis or CPUs is not a relevant distinction for HPA enabling and replica limits. If cluster has enough of Gaudis or CPUs, the same values can be used. |
Issues raised by Theresa when testing this PR, which must be fixed before merging:
Other feedback from internal reviews:
Note: support for ChatQnA instances being in separate namespaces will definitely come only in some future PR. Fiddling with Prometheus RBAC rules is clearly out of scope for this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me but we need to find out a way to unifying the naming of those helm value yaml files to meet the following scenarios, and make CICD to recognize those
- value files for different workload feature variants: e.g. gaurdrails-values.yaml
- value files for different optional feature: e.g. hpa-values.yaml,
- value files for different vendor specific hw: e.g. gaudi-values.yaml
- the combinations of the above 3
But this is another story, not related to HPA
And document need for specifying resources. Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
https://helm.sh/docs/chart_best_practices/templates/ Signed-off-by: Eero Tamminen <[email protected]>
052b80d
to
2c11fe8
Compare
I believe the 10 mins wait time for deployment ready( |
So that HPA scaling rules use custom metrics for correct set of TGI/TEI instances. Signed-off-by: Eero Tamminen <[email protected]>
* Use custom metrics configMap name that does NOT match one installed by Prometheus, because Helm does not allow overwriting object created by another Helm chart (like using manifests would). * Add Prometheus release name to serviceMonitors. Otherwise Helm installed Prometheus does not find serviceMonitors. Alternative would be using: prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false Reported by Theresa. Signed-off-by: Eero Tamminen <[email protected]>
Pushed fixes for these issues, meaning:
Discussed with Theresa, and I will be doing minimal manifest update (later today). I.e. fix non-HPA TGI/TEI deployments, and reduce HPA/ dir content to fixed HPA + custom metric rules and serviceMonitors. Deployment resource limits will come later for GMC. Handling manifest generation properly for deployments required changing replica and terminationTimeout settings to depend on something else than HPA. Termination timeout really depends on whether pod is accelerated, not whether HPA is used, so there's no new |
Timeout is question of Pod slowness i.e. is it accelerated, not HPA. Replicas need to be set only when count is set to something else than the default value (1). This will work also for HPA, while making GMC manifest generation easier. Signed-off-by: Eero Tamminen <[email protected]>
How to install Prometheus with Helm and fix doc issues. Having a manual step that replaces existing PrometheusAdapter config with the generated one should work in all cases (it's a workaround for Helm refusing to overwrite object created by another chart). Signed-off-by: Eero Tamminen <[email protected]>
Instead of replacing deployments with HPA/ ones, apply fixes directly to the normal deployment manifests. K8s default is 1 replica, so that can be dropped, which works also better with HPA. Because resource requests are model specific, and GMC is used to change model, HPA/ manifests won't help with that, GMC needs to take care of that (eventually). Signed-off-by: Eero Tamminen <[email protected]>
And current Helm charts contents. Signed-off-by: Eero Tamminen <[email protected]>
720538a
to
f46f733
Compare
Merged GMC manifest update was done manually, because current Helm charts do not support differentiating deployments enough based on whether it's an accelerated (= much faster) deployment. Somebody would need to change all deployment templates to apply proper timings based on whether CPU or accelerator (e.g. Gaudi) is used, similarly to how I did with termination timeouts. However, that's a generic problem in the charts, so it belongs to some other PR, not into this v1.0 milestone one. (While the problem is generic, its effect will be multiplied when number of pods is scaled, like happens with HPA.) @irisdingbj is the PR now OK from your side? |
Variable was left-over from "Fixes for Helm installed Prometheus version". Signed-off-by: Eero Tamminen <[email protected]>
938a0f7
to
7eb3690
Compare
Signed-off-by: Eero Tamminen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eero-t Please see my embedded comments. @yongfengdu any comment on this?
@@ -110,12 +110,14 @@ spec: | |||
port: http | |||
initialDelaySeconds: 5 | |||
periodSeconds: 5 | |||
timeoutSeconds: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The manifest files are under microservices-connector/config/manifests
directory is generated by helm-charts/update_manifests.sh
automatically. So if you want to change the default probe settings, please change it in corresponding component's helm chart, i.e. helm-chart/common/tei/values.yaml
. Otherwise, these values will be easily overwritten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lianhao Yes. Using cpu-values.yaml
in ChatQnA papers over the (CPU) issues with current Helm charts, but a proper fix would be needed for generating the GMC manifests. However, that is not HPA specific problem, and requires more discussion[1][2] so IMHO it was clearly out of scope for this PR (although scaling makes these problems much more visible).
[1] What's IMHO needed to fix things for probe timings:
- Common components' value files include different probe timing sections for CPU and for accelerators
- Their deployment templates select one based on
.Values.accelDevice
value (empty for CPU) - All
<device>-values.yaml
files set appropriate<subchart>.accelDevice
value (not just ChatQnA) - GMC device variant manifests are generated for all relevant components, not just TGI
(I don't think probe timings would need to be fine-tuned based on which model is used.)
[2] What's IMHO needed to fix resource requests:
- Current sub-optimal component arguments are optimized, and resulting resource requirements are profiled, for all relevant models
- For example on SPR, CPU TGI data type can be set to bfloat16, which halves its memory usage
- Observability support + suitable Grafana dashboards will help with profiling
- Instead of subcomponent model & corresponding resources being specified in top-level chart,
helm install
command uses suitable model+resource+args file from given component, like this:-f common/tgi/gaudi/neural-chat-7b.yaml
-f common/teirerank/gaudi/bge-reranker-base.yaml
-f common/tei/cpu/bge-base.yaml
-f common/data-prep/gpu/values.yaml
- (These would provide values with subchart prefix/heading so they can be used from top-level charts)
- There could also be a global option for ignoring (CPU side) resource requests that can be used when things need to be re-profiled
- GMC applies resource specs generated from these, when user changes model
If there are combinations of above which are common between different top-level charts, I would suggest Makefile
merging relevant ones to common files (e.g. -f common/gaudi-model-defaults.yaml
), to avoid duplicating things that may need to be updated whenever args, model, or image versions get updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sync the mismatch between helm chart and GMC manifests introduced by PR opea-project#386 Signed-off-by: Lianhao Lu <[email protected]>
Sync the mismatch between helm chart and GMC manifests introduced by PR opea-project#386 Signed-off-by: Lianhao Lu <[email protected]>
Sync the mismatch between helm chart and GMC manifests introduced by PR #386 Signed-off-by: Lianhao Lu <[email protected]>
Continuation of / depends on: #327
Description
Helm charts need fixes for HPA to work (with less changes) out of the box:
Ready
stateAdditionally PR also adds:
(Best reviewed by checking individual commits.)
Issues
n/a
Type of change
Dependencies
n/a
Tests
Manual testing for the changes (before splitting them to separate commits & documenting them).