diff --git a/helm-charts/HPA.md b/helm-charts/HPA.md new file mode 100644 index 00000000..d4e03517 --- /dev/null +++ b/helm-charts/HPA.md @@ -0,0 +1,196 @@ +# HorizontalPodAutoscaler (HPA) support + +## Table of Contents + +- [Introduction](#introduction) +- [Pre-conditions](#pre-conditions) + - [Resource requests](#resource-requests) + - [Prometheus](#prometheus) +- [Gotchas](#gotchas) +- [Enable HPA](#enable-hpa) + - [Install](#install) + - [Post-install](#post-install) +- [Verify](#verify) + +## Introduction + +`horizontalPodAutoscaler` option enables HPA scaling for the TGI and TEI inferencing deployments: +https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ + +Autoscaling is based on custom application metrics provided through [Prometheus](https://prometheus.io/). + +## Pre-conditions + +Read [post-install](#post-install) steps before installation! + +### Resource requests + +HPA controlled CPU pods SHOULD have appropriate resource requests or affinity rules (enabled in their +subcharts and tested to work) so that k8s scheduler does not schedule too many of them on the same +node(s). Otherwise they never reach ready state. + +If you use different models than the default ones, update TGI and TEI resource requests to match +model requirements. + +Too large requests would not be a problem as long as pods still fit to available nodes. However, +unless rules have been added to pods preventing them from being scheduled on same nodes, too +small requests would be an issue: + +- Multiple inferencing instances interfere / slow down each other, especially if there are no + [NRI policies](https://github.com/opea-project/GenAIEval/tree/main/doc/platform-optimization) + that provide further isolation +- Containers can become non-functional when their actual resource usage crosses the specified limits + +### Prometheus + +If cluster does not run [Prometheus operator](https://github.com/prometheus-operator/kube-prometheus) +yet, it SHOULD be be installed before enabling HPA, e.g. by using a Helm chart for it: +https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack + +Prometheus-adapter is also needed, to provide k8s custom metrics based on collected TGI / TEI metrics: +https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-adapter + +To install (older versions) of them: + +```console +$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts +$ helm repo update +$ prom_ns=monitoring # namespace for Prometheus/-adapter +$ kubectl create ns $prom_ns +$ helm install prometheus-stack prometheus-community/kube-prometheus-stack --version 55.5.2 -n $prom_ns +$ kubectl get services -n $prom_ns +$ helm install prometheus-adapter prometheus-community/prometheus-adapter --version 4.10.0 -n $prom_ns \ + --set prometheus.url=http://prometheus-stack-kube-prom-prometheus.$prom_ns.svc \ + --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false +``` + +NOTE: the service name given above in `prometheus.url` must match the listed Prometheus service name, +otherwise adapter cannot access it! + +(Alternative for setting the above `prometheusSpec` variable to `false` is making sure that +`prometheusRelease` value in top-level chart matches the release name given to the Prometheus +install i.e. when it differs from `prometheus-stack` used above. That is used to annotate +created serviceMonitors with a label Prometheus requires when above option is `true`.) + +## Gotchas + +Why HPA is opt-in: + +- Installing custom metrics for HPA requires manual post-install steps, as + Prometheus-operator and -adapter are missing support needed to automate that +- Top level chart name needs to conform to Prometheus metric naming conventions, + as it is also used as a metric name prefix (with dashes converted to underscores) +- By default Prometheus adds [k8s RBAC rules](https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-roleBindingSpecificNamespaces.yaml) + for accessing metrics from `default`, `kube-system` and `monitoring` namespaces. If Helm is + asked to install OPEA services to some other namespace, those rules need to be updated accordingly +- Unless pod resource requests, affinity rules, scheduling topology constraints and/or cluster NRI + policies are used to better isolate service inferencing pods from each other, instances + scaled up on same node may never get to ready state +- Current HPA rules are just examples, for efficient scaling they need to be fine-tuned for given setup + performance (underlying HW, used models and data types, OPEA version etc) +- Debugging missing custom metric issues is hard as logs rarely include anything helpful + +## Enable HPA + +### Install + +ChatQnA includes pre-configured values files for scaling the services. + +To enable HPA, add `-f chatqna/hpa-values.yaml` option to your `helm install` command line. + +If **CPU** versions of TGI (and TEI) services are being scaled, resource requests and probe timings +suitable for CPU usage need to be used. Add `-f chatqna/cpu-values.yaml` option to your `helm install` +line. If you need to change model specified there, update the resource requests accordingly. + +### Post-install + +Above step created custom metrics config for Prometheus-adapter suitable for HPA use. + +Take backup of existing custom metrics config before replacing it: + +```console +$ prom_ns=monitoring # Prometheus/-adapter namespace +$ name=$(kubectl -n $prom_ns get cm --selector app.kubernetes.io/name=prometheus-adapter -o name | cut -d/ -f2) +$ kubectl -n $prom_ns get cm/$name -o yaml > adapter-config.yaml.bak +``` + +Save generated config with values matching current adapter config: + +```console +$ chart=chatqna # OPEA chart release name +$ kubectl get cm/$chart-custom-metrics -o yaml | sed \ + -e "s/name:.*custom-metrics$/name: $name/" \ + -e "s/namespace: default$/namespace: $prom_ns/" \ + > adapter-config.yaml +``` + +NOTE: if there are existing custom metric rules you need to retain, add them from saved +`adapter-config.yaml.bak` to `adapter-config.yaml` file now! + +Overwrite current Prometheus-adapter configMap with generated one: + +```console +$ kubectl delete -n $prom_ns cm/$name +$ kubectl apply -f adapter-config.yaml +``` + +And restart it, so that it will use the new config: + +```console +$ selector=app.kubernetes.io/name=prometheus-adapter +$ kubectl -n $prom_ns delete $(kubectl -n $prom_ns get pod --selector $selector -o name) +``` + +## Verify + +To verify that horizontalPodAutoscaler options work, it's better to check that both metrics +from the inferencing services, and HPA rules using custom metrics generated from them, do work. + +(Names of the object names depend on whether Prometheus was installed from manifests, or Helm, +and the release name given for its Helm install.) + +Check installed Prometheus service names: + +```console +$ prom_ns=monitoring # Prometheus/-adapter namespace +$ kubectl -n $prom_ns get svc +``` + +Use service name matching your Prometheus installation: + +```console +$ prom_svc=prometheus-stack-kube-prom-prometheus # Metrics service +``` + +Verify Prometheus found metric endpoints for chart services, i.e. last number on `curl` output is non-zero: + +```console +$ chart=chatqna # OPEA chart release name +$ prom_url=http://$(kubectl -n $prom_ns get -o jsonpath="{.spec.clusterIP}:{.spec.ports[0].port}" svc/$prom_svc) +$ curl --no-progress-meter $prom_url/metrics | grep scrape_pool_targets.*$chart +``` + +**NOTE**: TGI and TEI inferencing services provide metrics endpoint only after they've processed +their first request, and reranking service will be used only after context data has been uploaded! + +Check that both Prometheus metrics required from TGI are available: + +```console +$ for m in sum count; do + curl --no-progress-meter $prom_url/api/v1/query? \ + --data-urlencode query=tgi_request_inference_duration_$m'{service="'$chart'-tgi"}' | jq; +done | grep __name__ +``` + +PrometheusAdapter lists corresponding TGI and/or TEI custom metrics (prefixed with chart name): + +```console +$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .resources[].name +``` + +And HPA rules have TARGET values for HPA controlled service deployments (instead of ``): + +```console +$ ns=default # OPEA namespace +$ kubectl -n $ns get hpa +``` diff --git a/helm-charts/README.md b/helm-charts/README.md index c4eef858..afddb19a 100644 --- a/helm-charts/README.md +++ b/helm-charts/README.md @@ -9,10 +9,6 @@ This directory contains helm charts for [GenAIComps](https://github.com/opea-pro - [Components](#components) - [How to deploy with helm charts](#deploy-with-helm-charts) - [Helm Charts Options](#helm-charts-options) -- [HorizontalPodAutoscaler (HPA) support](#horizontalpodautoscaler-hpa-support) - - [Pre-conditions](#pre-conditions) - - [Gotchas](#gotchas) - - [Verify HPA metrics](#verify-hpa-metrics) - [Using Persistent Volume](#using-persistent-volume) - [Using Private Docker Hub](#using-private-docker-hub) - [Helm Charts repository](#helm-chart-repository) @@ -66,71 +62,9 @@ There are global options(which should be shared across all components of a workl | global | http_proxy https_proxy no_proxy | Proxy settings. If you are running the workloads behind the proxy, you'll have to add your proxy settings here. | | global | modelUsePVC | The PersistentVolumeClaim you want to use as huggingface hub cache. Default "" means not using PVC. Only one of modelUsePVC/modelUseHostPath can be set. | | global | modelUseHostPath | If you don't have Persistent Volume in your k8s cluster and want to use local directory as huggingface hub cache, set modelUseHostPath to your local directory name. Note that this can't share across nodes. Default "". Only one of modelUsePVC/modelUseHostPath can be set. | -| global | horizontalPodAutoscaler.enabled | Enable HPA autoscaling for TGI and TEI service deployments based on metrics they provide. See #pre-conditions and #gotchas before enabling! | +| chatqna | horizontalPodAutoscaler.enabled | Enable HPA autoscaling for TGI and TEI service deployments based on metrics they provide. See [Pre-conditions](HPA.md#pre-conditions) and [Gotchas](HPA.md#gotchas) before enabling! | | tgi | LLM_MODEL_ID | The model id you want to use for tgi server. Default "Intel/neural-chat-7b-v3-3". | -## HorizontalPodAutoscaler (HPA) support - -`horizontalPodAutoscaler` option enables HPA scaling for the TGI and TEI inferencing deployments: -https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ - -Autoscaling is based on custom application metrics provided through [Prometheus](https://prometheus.io/). - -### Pre-conditions - -If cluster does not run [Prometheus operator](https://github.com/prometheus-operator/kube-prometheus) -yet, it SHOULD be be installed before enabling HPA, e.g. by using: -https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack - -Enabling HPA in top-level Helm chart (e.g. `chatqna`), overwrites cluster's current _PrometheusAdapter_ -configuration with relevant custom metric queries. If that has queries you wish to retain, _or_ HPA is -otherwise enabled only in TGI or TEI subchart(s), you need add relevat queries to _PrometheusAdapter_ -configuration _manually_ (e.g. from `chatqna` custom metrics Helm template). - -### Gotchas - -Why HPA is opt-in: - -- Enabling (top level) chart `horizontalPodAutoscaler` option will _overwrite_ cluster's current - `PrometheusAdapter` configuration with its own custom metrics configuration. - Take copy of the existing one before install, if that matters: - `kubectl -n monitoring get cm/adapter-config -o yaml > adapter-config.yaml` -- `PrometheusAdapter` needs to be restarted after install, for it to read the new configuration: - `ns=monitoring; kubectl -n $ns delete $(kubectl -n $ns get pod --selector app.kubernetes.io/name=prometheus-adapter -o name)` -- By default Prometheus adds [k8s RBAC rules](https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-roleBindingSpecificNamespaces.yaml) - for accessing metrics from `default`, `kube-system` and `monitoring` namespaces. If Helm is - asked to install OPEA services to some other namespace, those rules need to be updated accordingly -- Current HPA rules are examples for Xeon, for efficient scaling they need to be fine-tuned for given setup - performance (underlying HW, used models and data types, OPEA version etc) - -### Verify HPA metrics - -To verify that metrics required by horizontalPodAutoscaler option work, check following... - -Prometheus has found the metric endpoints, i.e. last number on `curl` output is non-zero: - -```console -chart=chatqna; # OPEA services prefix -ns=monitoring; # Prometheus namespace -prom_url=http://$(kubectl -n $ns get -o jsonpath="{.spec.clusterIP}:{.spec.ports[0].port}" svc/prometheus-k8s); -curl --no-progress-meter $prom_url/metrics | grep scrape_pool_targets.*$chart -``` - -**NOTE**: TGI and TEI inferencing services provide metrics endpoint only after they've processed their first request! - -PrometheusAdapter lists TGI and/or TGI custom metrics (`te_*` / `tgi_*`): - -```console -kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .resources[].name -``` - -HPA rules list valid (not ``) TARGET values for service deployments: - -```console -ns=default; # OPEA namespace -kubectl -n $ns get hpa -``` - ## Using Persistent Volume It's common to use Persistent Volume(PV) for model caches(huggingface hub cache) in a production k8s cluster. We support to pass the PersistentVolumeClaim(PVC) to containers, but it's the user's responsibility to create the PVC depending on your k8s cluster's capability. diff --git a/helm-charts/chatqna/cpu-values.yaml b/helm-charts/chatqna/cpu-values.yaml new file mode 100644 index 00000000..b4c5ee5d --- /dev/null +++ b/helm-charts/chatqna/cpu-values.yaml @@ -0,0 +1,109 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +# Override CPU resource request and probe timing values in specific subcharts +# +# RESOURCES +# +# Resource request matching actual resource usage (with enough slack) +# is important when service is scaled up, so that right amount of pods +# get scheduled to right nodes. +# +# Because resource usage depends on the used devices, model, data type +# and SW versions, and this top-level chart has overrides for them, +# resource requests need to be specified here too. +# +# To test service without resource request, use "resources: {}". +# +# PROBES +# +# Inferencing pods startup / warmup takes *much* longer on CPUs than +# with acceleration devices, and their responses are also slower, +# especially when node is running several instances of these services. +# +# Kubernetes restarting pod before its startup finishes, or not +# sending it queries because it's not in ready state due to slow +# readiness responses, does really NOT help in getting faster responses. +# +# => probe timings need to be increased when running on CPU. + +tgi: + # TODO: add Helm value also for TGI data type option: + # https://github.com/opea-project/GenAIExamples/issues/330 + LLM_MODEL_ID: Intel/neural-chat-7b-v3-3 + + # Potentially suitable values for scaling CPU TGI 2.2 with Intel/neural-chat-7b-v3-3 @ 32-bit: + resources: + limits: + cpu: 8 + memory: 70Gi + requests: + cpu: 6 + memory: 65Gi + + livenessProbe: + initialDelaySeconds: 8 + periodSeconds: 8 + failureThreshold: 24 + timeoutSeconds: 4 + readinessProbe: + initialDelaySeconds: 16 + periodSeconds: 8 + timeoutSeconds: 4 + startupProbe: + initialDelaySeconds: 10 + periodSeconds: 5 + failureThreshold: 180 + timeoutSeconds: 2 + +teirerank: + RERANK_MODEL_ID: "BAAI/bge-reranker-base" + + # Potentially suitable values for scaling CPU TEI v1.5 with BAAI/bge-reranker-base model: + resources: + limits: + cpu: 4 + memory: 30Gi + requests: + cpu: 2 + memory: 25Gi + + livenessProbe: + initialDelaySeconds: 8 + periodSeconds: 8 + failureThreshold: 24 + timeoutSeconds: 4 + readinessProbe: + initialDelaySeconds: 8 + periodSeconds: 8 + timeoutSeconds: 4 + startupProbe: + initialDelaySeconds: 5 + periodSeconds: 5 + failureThreshold: 120 + +tei: + EMBEDDING_MODEL_ID: "BAAI/bge-base-en-v1.5" + + # Potentially suitable values for scaling CPU TEI 1.5 with BAAI/bge-base-en-v1.5 model: + resources: + limits: + cpu: 4 + memory: 4Gi + requests: + cpu: 2 + memory: 3Gi + + livenessProbe: + initialDelaySeconds: 5 + periodSeconds: 5 + failureThreshold: 24 + timeoutSeconds: 2 + readinessProbe: + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 2 + startupProbe: + initialDelaySeconds: 5 + periodSeconds: 5 + failureThreshold: 120 diff --git a/helm-charts/chatqna/gaudi-values.yaml b/helm-charts/chatqna/gaudi-values.yaml index e5aa5505..21d30870 100644 --- a/helm-charts/chatqna/gaudi-values.yaml +++ b/helm-charts/chatqna/gaudi-values.yaml @@ -2,6 +2,7 @@ # SPDX-License-Identifier: Apache-2.0 tei: + accelDevice: "gaudi" image: repository: ghcr.io/huggingface/tei-gaudi tag: synapse_1.16 @@ -13,6 +14,7 @@ tei: # To override values in subchart tgi tgi: + accelDevice: "gaudi" image: repository: ghcr.io/huggingface/tgi-gaudi tag: "2.0.1" diff --git a/helm-charts/chatqna/guardrails-gaudi-values.yaml b/helm-charts/chatqna/guardrails-gaudi-values.yaml index bb88c8bd..2533c725 100644 --- a/helm-charts/chatqna/guardrails-gaudi-values.yaml +++ b/helm-charts/chatqna/guardrails-gaudi-values.yaml @@ -13,6 +13,7 @@ guardrails-usvc: # gaudi related config tei: + accelDevice: "gaudi" image: repository: ghcr.io/huggingface/tei-gaudi tag: synapse_1.16 @@ -23,6 +24,7 @@ tei: readOnlyRootFilesystem: false tgi: + accelDevice: "gaudi" image: repository: ghcr.io/huggingface/tgi-gaudi tag: "2.0.1" @@ -34,6 +36,7 @@ tgi: CUDA_GRAPHS: "" tgi-guardrails: + accelDevice: "gaudi" LLM_MODEL_ID: "meta-llama/Meta-Llama-Guard-2-8B" image: repository: ghcr.io/huggingface/tgi-gaudi diff --git a/helm-charts/chatqna/hpa-values.yaml b/helm-charts/chatqna/hpa-values.yaml new file mode 100644 index 00000000..d272eea0 --- /dev/null +++ b/helm-charts/chatqna/hpa-values.yaml @@ -0,0 +1,29 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +# Enable HorizontalPodAutoscaler (HPA) +# +# That will overwrite named PrometheusAdapter configMap with ChatQnA specific +# custom metric queries for embedding, reranking, tgi services. +# +# Default upstream configMap is in: +# - https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml + +horizontalPodAutoscaler: + enabled: true + +# Override values in specific subcharts + +# Enabling "horizontalPodAutoscaler" for any of the subcharts requires enabling it also above! +tgi: + horizontalPodAutoscaler: + maxReplicas: 4 + enabled: true +teirerank: + horizontalPodAutoscaler: + maxReplicas: 3 + enabled: true +tei: + horizontalPodAutoscaler: + maxReplicas: 2 + enabled: true diff --git a/helm-charts/chatqna/nv-values.yaml b/helm-charts/chatqna/nv-values.yaml index 79ffec91..63e3eaf4 100644 --- a/helm-charts/chatqna/nv-values.yaml +++ b/helm-charts/chatqna/nv-values.yaml @@ -3,6 +3,7 @@ # To override values in subchart tgi tgi: + accelDevice: "nvidia" image: repository: ghcr.io/huggingface/text-generation-inference tag: "2.2.0" diff --git a/helm-charts/chatqna/templates/customMetrics.yaml b/helm-charts/chatqna/templates/custom-metrics-configmap.yaml similarity index 73% rename from helm-charts/chatqna/templates/customMetrics.yaml rename to helm-charts/chatqna/templates/custom-metrics-configmap.yaml index 64123df0..17b23903 100644 --- a/helm-charts/chatqna/templates/customMetrics.yaml +++ b/helm-charts/chatqna/templates/custom-metrics-configmap.yaml @@ -1,18 +1,29 @@ # Copyright (C) 2024 Intel Corporation # SPDX-License-Identifier: Apache-2.0 -{{- if .Values.global.horizontalPodAutoscaler.enabled }} +{{- if .Values.horizontalPodAutoscaler.enabled }} apiVersion: v1 +kind: ConfigMap +metadata: + # easy to find for the required manual step + namespace: default + name: {{ include "chatqna.fullname" . }}-custom-metrics + labels: + app.kubernetes.io/name: prometheus-adapter data: config.yaml: | rules: + {{- if .Values.tgi.horizontalPodAutoscaler.enabled }} + # check metric with: + # kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/{{ include "tgi.metricPrefix" .Subcharts.tgi }}_request_latency | jq + # - seriesQuery: '{__name__="tgi_request_inference_duration_sum",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}' # Average request latency from TGI histograms, over 1 min # (0.001 divider add is to make sure there's always a valid value) metricsQuery: 'rate(tgi_request_inference_duration_sum{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(tgi_request_inference_duration_count{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]))' name: matches: ^tgi_request_inference_duration_sum - as: "tgi_request_latency" + as: "{{ include "tgi.metricPrefix" .Subcharts.tgi }}_request_latency" resources: # HPA needs both namespace + suitable object resource for its query paths: # /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/tgi_request_latency @@ -22,32 +33,33 @@ data: resource: namespace service: resource: service + {{- end }} + {{- if .Values.teirerank.horizontalPodAutoscaler.enabled }} - seriesQuery: '{__name__="te_request_inference_duration_sum",service="{{ include "teirerank.fullname" .Subcharts.teirerank }}"}' # Average request latency from TEI histograms, over 1 min metricsQuery: 'rate(te_request_inference_duration_sum{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>}[1m]))' name: matches: ^te_request_inference_duration_sum - as: "reranking_request_latency" + as: "{{ include "teirerank.metricPrefix" .Subcharts.teirerank }}_request_latency" resources: overrides: namespace: resource: namespace service: resource: service + {{- end }} + {{- if .Values.tei.horizontalPodAutoscaler.enabled }} - seriesQuery: '{__name__="te_request_inference_duration_sum",service="{{ include "tei.fullname" .Subcharts.tei }}"}' # Average request latency from TEI histograms, over 1 min metricsQuery: 'rate(te_request_inference_duration_sum{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>}[1m]))' name: matches: ^te_request_inference_duration_sum - as: "embedding_request_latency" + as: "{{ include "tei.metricPrefix" .Subcharts.tei }}_request_latency" resources: overrides: namespace: resource: namespace service: resource: service -kind: ConfigMap -metadata: - name: adapter-config - namespace: monitoring + {{- end }} {{- end }} diff --git a/helm-charts/chatqna/values.yaml b/helm-charts/chatqna/values.yaml index bfacc09a..8fb54ffd 100644 --- a/helm-charts/chatqna/values.yaml +++ b/helm-charts/chatqna/values.yaml @@ -35,7 +35,12 @@ tolerations: [] affinity: {} -# To override values in subchart tgi +# This is just to avoid Helm errors when HPA is NOT used +# (use hpa-values.yaml files to actually enable HPA). +horizontalPodAutoscaler: + enabled: false + +# Override values in specific subcharts tgi: LLM_MODEL_ID: Intel/neural-chat-7b-v3-3 @@ -62,10 +67,5 @@ global: # modelUseHostPath: /mnt/opea-models # modelUsePVC: model-volume - # Enabling HorizontalPodAutoscaler (HPA) will: - # - Overwrite existing PrometheusAdapter "adapter-config" configMap with ChatQnA specific custom metric queries - # for embedding, reranking, tgi services - # Upstream default configMap: - # - https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml - horizontalPodAutoscaler: - enabled: false + # Prometheus Helm installation info for subchart serviceMonitors + prometheusRelease: prometheus-stack diff --git a/helm-charts/common/tei/gaudi-values.yaml b/helm-charts/common/tei/gaudi-values.yaml index 9d1b2690..17358ea6 100644 --- a/helm-charts/common/tei/gaudi-values.yaml +++ b/helm-charts/common/tei/gaudi-values.yaml @@ -5,6 +5,8 @@ # This is a YAML-formatted file. # Declare variables to be passed into your templates. +accelDevice: "gaudi" + image: repository: ghcr.io/huggingface/tei-gaudi tag: synapse_1.16 diff --git a/helm-charts/common/tei/templates/_helpers.tpl b/helm-charts/common/tei/templates/_helpers.tpl index 4158a861..fc4a5743 100644 --- a/helm-charts/common/tei/templates/_helpers.tpl +++ b/helm-charts/common/tei/templates/_helpers.tpl @@ -30,6 +30,13 @@ Create chart name and version as used by the chart label. {{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} {{- end }} +{{/* +Convert chart name to a string suitable as metric prefix +*/}} +{{- define "tei.metricPrefix" -}} +{{- include "tei.fullname" . | replace "-" "_" | regexFind "[a-zA-Z_:][a-zA-Z0-9_:]*" }} +{{- end }} + {{/* Common labels */}} diff --git a/helm-charts/common/tei/templates/deployment.yaml b/helm-charts/common/tei/templates/deployment.yaml index 798f979f..f9536f36 100644 --- a/helm-charts/common/tei/templates/deployment.yaml +++ b/helm-charts/common/tei/templates/deployment.yaml @@ -8,8 +8,8 @@ metadata: labels: {{- include "tei.labels" . | nindent 4 }} spec: - # use explicit replica counts only of HorizontalPodAutoscaler is disabled - {{- if not .Values.global.horizontalPodAutoscaler.enabled }} + {{- if ne (int .Values.replicaCount) 1 }} + # remove if replica count should not be reset on pod update with HPA replicas: {{ .Values.replicaCount }} {{- end }} selector: @@ -105,8 +105,8 @@ spec: tolerations: {{- toYaml . | nindent 8 }} {{- end }} - {{- if .Values.global.horizontalPodAutoscaler.enabled }} - # extra time to finish processing buffered requests before HPA forcibly terminates pod + {{- if not .Values.accelDevice }} + # extra time to finish processing buffered requests on CPU before pod is forcibly terminated terminationGracePeriodSeconds: 60 {{- end }} {{- if .Values.evenly_distributed }} diff --git a/helm-charts/common/tei/templates/horizontalPodAutoscaler.yaml b/helm-charts/common/tei/templates/horizontal-pod-autoscaler.yaml similarity index 90% rename from helm-charts/common/tei/templates/horizontalPodAutoscaler.yaml rename to helm-charts/common/tei/templates/horizontal-pod-autoscaler.yaml index a448b96c..0da41daf 100644 --- a/helm-charts/common/tei/templates/horizontalPodAutoscaler.yaml +++ b/helm-charts/common/tei/templates/horizontal-pod-autoscaler.yaml @@ -1,7 +1,7 @@ # Copyright (C) 2024 Intel Corporation # SPDX-License-Identifier: Apache-2.0 -{{- if .Values.global.horizontalPodAutoscaler.enabled }} +{{- if .Values.horizontalPodAutoscaler.enabled }} apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: @@ -17,8 +17,8 @@ spec: - type: Object object: metric: - # tei-embedding time metrics are in seconds - name: embedding_request_latency + # TEI time metrics are in seconds + name: {{ include "tei.metricPrefix" . }}_request_latency describedObject: apiVersion: v1 # get metric for named object of given type (in same namespace) diff --git a/helm-charts/common/tei/templates/servicemonitor.yaml b/helm-charts/common/tei/templates/servicemonitor.yaml index 05c25528..70398ff6 100644 --- a/helm-charts/common/tei/templates/servicemonitor.yaml +++ b/helm-charts/common/tei/templates/servicemonitor.yaml @@ -1,11 +1,13 @@ # Copyright (C) 2024 Intel Corporation # SPDX-License-Identifier: Apache-2.0 -{{- if .Values.global.horizontalPodAutoscaler.enabled }} +{{- if .Values.horizontalPodAutoscaler.enabled }} apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: {{ include "tei.fullname" . }} + labels: + release: {{ .Values.global.prometheusRelease }} spec: selector: matchLabels: diff --git a/helm-charts/common/tei/values.yaml b/helm-charts/common/tei/values.yaml index 387de250..486b9adb 100644 --- a/helm-charts/common/tei/values.yaml +++ b/helm-charts/common/tei/values.yaml @@ -7,8 +7,13 @@ replicaCount: 1 +# Enabling HPA will: +# - Ignore above replica count, as it will be controlled by HPA +# - Add example HPA scaling rules with thresholds suitable for Xeon deployments +# - Require custom metrics ConfigMap available in the main application chart horizontalPodAutoscaler: maxReplicas: 2 + enabled: false port: 2081 shmSize: 1Gi @@ -19,6 +24,9 @@ image: # Overrides the image tag whose default is the chart appVersion. tag: "cpu-1.5" +# empty for CPU +accelDevice: "" + imagePullSecrets: [] nameOverride: "" fullnameOverride: "" @@ -95,9 +103,6 @@ global: # By default, both var are set to empty, the model will be downloaded and saved to a tmp volume. modelUseHostPath: "" modelUsePVC: "" - # Enabling HPA will: - # - Ignore above replica count, as it will be controlled by HPA - # - Add example HPA scaling rules with thresholds suitable for Xeon deployments - # - Require custom metrics ConfigMap available in the main application chart - horizontalPodAutoscaler: - enabled: false + + # Prometheus Helm installation info for serviceMonitor + prometheusRelease: prometheus-stack diff --git a/helm-charts/common/teirerank/templates/_helpers.tpl b/helm-charts/common/teirerank/templates/_helpers.tpl index 43ef5c71..0c0b9238 100644 --- a/helm-charts/common/teirerank/templates/_helpers.tpl +++ b/helm-charts/common/teirerank/templates/_helpers.tpl @@ -30,6 +30,13 @@ Create chart name and version as used by the chart label. {{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} {{- end }} +{{/* +Convert chart name to a string suitable as metric prefix +*/}} +{{- define "teirerank.metricPrefix" -}} +{{- include "teirerank.fullname" . | replace "-" "_" | regexFind "[a-zA-Z_:][a-zA-Z0-9_:]*" }} +{{- end }} + {{/* Common labels */}} diff --git a/helm-charts/common/teirerank/templates/deployment.yaml b/helm-charts/common/teirerank/templates/deployment.yaml index 28f2099d..8ea5da56 100644 --- a/helm-charts/common/teirerank/templates/deployment.yaml +++ b/helm-charts/common/teirerank/templates/deployment.yaml @@ -8,8 +8,8 @@ metadata: labels: {{- include "teirerank.labels" . | nindent 4 }} spec: - # use explicit replica counts only of HorizontalPodAutoscaler is disabled - {{- if not .Values.global.horizontalPodAutoscaler.enabled }} + {{- if ne (int .Values.replicaCount) 1 }} + # remove if replica count should not be reset on pod update with HPA replicas: {{ .Values.replicaCount }} {{- end }} selector: @@ -105,8 +105,8 @@ spec: tolerations: {{- toYaml . | nindent 8 }} {{- end }} - {{- if .Values.global.horizontalPodAutoscaler.enabled }} - # extra time to finish processing buffered requests before HPA forcibly terminates pod + {{- if not .Values.accelDevice }} + # extra time to finish processing buffered requests on CPU before pod is forcibly terminated terminationGracePeriodSeconds: 60 {{- end }} {{- if .Values.evenly_distributed }} diff --git a/helm-charts/common/teirerank/templates/horizontalPodAutoscaler.yaml b/helm-charts/common/teirerank/templates/horizontal-pod-autoscaler.yaml similarity index 89% rename from helm-charts/common/teirerank/templates/horizontalPodAutoscaler.yaml rename to helm-charts/common/teirerank/templates/horizontal-pod-autoscaler.yaml index bb249305..c5914fca 100644 --- a/helm-charts/common/teirerank/templates/horizontalPodAutoscaler.yaml +++ b/helm-charts/common/teirerank/templates/horizontal-pod-autoscaler.yaml @@ -1,7 +1,7 @@ # Copyright (C) 2024 Intel Corporation # SPDX-License-Identifier: Apache-2.0 -{{- if .Values.global.horizontalPodAutoscaler.enabled }} +{{- if .Values.horizontalPodAutoscaler.enabled }} apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: @@ -17,8 +17,8 @@ spec: - type: Object object: metric: - # tei-reranking time metrics are in seconds - name: reranking_request_latency + # TEI time metrics are in seconds + name: {{ include "teirerank.metricPrefix" . }}_request_latency describedObject: apiVersion: v1 # get metric for named object of given type (in same namespace) diff --git a/helm-charts/common/teirerank/templates/servicemonitor.yaml b/helm-charts/common/teirerank/templates/servicemonitor.yaml index 52d355a7..423cb9fc 100644 --- a/helm-charts/common/teirerank/templates/servicemonitor.yaml +++ b/helm-charts/common/teirerank/templates/servicemonitor.yaml @@ -1,11 +1,13 @@ # Copyright (C) 2024 Intel Corporation # SPDX-License-Identifier: Apache-2.0 -{{- if .Values.global.horizontalPodAutoscaler.enabled }} +{{- if .Values.horizontalPodAutoscaler.enabled }} apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: {{ include "teirerank.fullname" . }} + labels: + release: {{ .Values.global.prometheusRelease }} spec: selector: matchLabels: diff --git a/helm-charts/common/teirerank/values.yaml b/helm-charts/common/teirerank/values.yaml index 01537c70..526a3af4 100644 --- a/helm-charts/common/teirerank/values.yaml +++ b/helm-charts/common/teirerank/values.yaml @@ -7,9 +7,13 @@ replicaCount: 1 - +# Enabling HPA will: +# - Ignore above replica count, as it will be controlled by HPA +# - Add example HPA scaling rules with thresholds suitable for Xeon deployments +# - Require custom metrics ConfigMap available in the main application chart horizontalPodAutoscaler: maxReplicas: 3 + enabled: false port: 2082 shmSize: 1Gi @@ -20,6 +24,9 @@ image: # Overrides the image tag whose default is the chart appVersion. tag: "cpu-1.5" +# empty for CPU +accelDevice: "" + imagePullSecrets: [] nameOverride: "" fullnameOverride: "" @@ -96,9 +103,6 @@ global: # By default, both var are set to empty, the model will be downloaded and saved to a tmp volume. modelUseHostPath: "" modelUsePVC: "" - # Enabling HPA will: - # - Ignore above replica count, as it will be controlled by HPA - # - Add example HPA scaling rules with thresholds suitable for Xeon deployments - # - Require custom metrics ConfigMap available in the main application chart - horizontalPodAutoscaler: - enabled: false + + # Prometheus Helm installation info for serviceMonitor + prometheusRelease: prometheus-stack diff --git a/helm-charts/common/tgi/gaudi-values.yaml b/helm-charts/common/tgi/gaudi-values.yaml index b2b783d3..25546c45 100644 --- a/helm-charts/common/tgi/gaudi-values.yaml +++ b/helm-charts/common/tgi/gaudi-values.yaml @@ -5,6 +5,8 @@ # This is a YAML-formatted file. # Declare variables to be passed into your templates. +accelDevice: "gaudi" + image: repository: ghcr.io/huggingface/tgi-gaudi tag: "2.0.1" diff --git a/helm-charts/common/tgi/nv-values.yaml b/helm-charts/common/tgi/nv-values.yaml index 883ced14..798af895 100644 --- a/helm-charts/common/tgi/nv-values.yaml +++ b/helm-charts/common/tgi/nv-values.yaml @@ -5,6 +5,8 @@ # This is a YAML-formatted file. # Declare variables to be passed into your templates. +accelDevice: "nvidia" + image: repository: ghcr.io/huggingface/text-generation-inference tag: "2.2.0" diff --git a/helm-charts/common/tgi/templates/_helpers.tpl b/helm-charts/common/tgi/templates/_helpers.tpl index 6e98919c..b672e830 100644 --- a/helm-charts/common/tgi/templates/_helpers.tpl +++ b/helm-charts/common/tgi/templates/_helpers.tpl @@ -30,6 +30,13 @@ Create chart name and version as used by the chart label. {{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} {{- end }} +{{/* +Convert chart name to a string suitable as metric prefix +*/}} +{{- define "tgi.metricPrefix" -}} +{{- include "tgi.fullname" . | replace "-" "_" | regexFind "[a-zA-Z_:][a-zA-Z0-9_:]*" }} +{{- end }} + {{/* Common labels */}} diff --git a/helm-charts/common/tgi/templates/deployment.yaml b/helm-charts/common/tgi/templates/deployment.yaml index bafd0ac3..511cead3 100644 --- a/helm-charts/common/tgi/templates/deployment.yaml +++ b/helm-charts/common/tgi/templates/deployment.yaml @@ -8,8 +8,8 @@ metadata: labels: {{- include "tgi.labels" . | nindent 4 }} spec: - # use explicit replica counts only of HorizontalPodAutoscaler is disabled - {{- if not .Values.global.horizontalPodAutoscaler.enabled }} + {{- if ne (int .Values.replicaCount) 1 }} + # remove if replica count should not be reset on pod update with HPA replicas: {{ .Values.replicaCount }} {{- end }} selector: @@ -109,8 +109,8 @@ spec: tolerations: {{- toYaml . | nindent 8 }} {{- end }} - {{- if .Values.global.horizontalPodAutoscaler.enabled }} - # extra time to finish processing buffered requests before HPA forcibly terminates pod + {{- if not .Values.accelDevice }} + # extra time to finish processing buffered requests on CPU before pod is forcibly terminated terminationGracePeriodSeconds: 120 {{- end }} {{- if .Values.evenly_distributed }} diff --git a/helm-charts/common/tgi/templates/horizontalPorAutoscaler.yaml b/helm-charts/common/tgi/templates/horizontal-pod-autoscaler.yaml similarity index 78% rename from helm-charts/common/tgi/templates/horizontalPorAutoscaler.yaml rename to helm-charts/common/tgi/templates/horizontal-pod-autoscaler.yaml index 1131bbdc..646ea9cc 100644 --- a/helm-charts/common/tgi/templates/horizontalPorAutoscaler.yaml +++ b/helm-charts/common/tgi/templates/horizontal-pod-autoscaler.yaml @@ -1,7 +1,7 @@ # Copyright (C) 2024 Intel Corporation # SPDX-License-Identifier: Apache-2.0 -{{- if .Values.global.horizontalPodAutoscaler.enabled }} +{{- if .Values.horizontalPodAutoscaler.enabled }} apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: @@ -18,7 +18,7 @@ spec: object: metric: # TGI time metrics are in seconds - name: tgi_request_latency + name: {{ include "tgi.metricPrefix" . }}_request_latency describedObject: apiVersion: v1 # get metric for named object of given type (in same namespace) @@ -37,15 +37,17 @@ spec: policies: - type: Percent value: 25 - periodSeconds: 15 + periodSeconds: 90 scaleUp: selectPolicy: Max stabilizationWindowSeconds: 0 policies: - - type: Percent - value: 50 - periodSeconds: 15 + # Slow linear rampup in case additional CPU pods go to same node + # (i.e. interfere with each other) - type: Pods - value: 2 - periodSeconds: 15 + value: 1 + periodSeconds: 90 + #- type: Percent + # value: 25 + # periodSeconds: 90 {{- end }} diff --git a/helm-charts/common/tgi/templates/servicemonitor.yaml b/helm-charts/common/tgi/templates/servicemonitor.yaml index 0d7d6ffb..fdb1159b 100644 --- a/helm-charts/common/tgi/templates/servicemonitor.yaml +++ b/helm-charts/common/tgi/templates/servicemonitor.yaml @@ -6,11 +6,13 @@ # Metric descriptions: # - https://github.com/huggingface/text-generation-inference/discussions/1127#discussioncomment-7240527 -{{- if .Values.global.horizontalPodAutoscaler.enabled }} +{{- if .Values.horizontalPodAutoscaler.enabled }} apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: {{ include "tgi.fullname" . }} + labels: + release: {{ .Values.global.prometheusRelease }} spec: selector: matchLabels: diff --git a/helm-charts/common/tgi/values.yaml b/helm-charts/common/tgi/values.yaml index d487851e..805df10b 100644 --- a/helm-charts/common/tgi/values.yaml +++ b/helm-charts/common/tgi/values.yaml @@ -7,8 +7,13 @@ replicaCount: 1 +# Enabling HPA will: +# - Ignore above replica count, as it will be controlled by HPA +# - Add example HPA scaling rules with thresholds suitable for Xeon deployments +# - Require custom metrics ConfigMap available in the main application chart horizontalPodAutoscaler: - maxReplicas: 6 + maxReplicas: 4 + enabled: false port: 2080 shmSize: 1Gi @@ -23,6 +28,9 @@ image: # Overrides the image tag whose default is the chart appVersion. tag: "2.2.0" +# empty for CPU +accelDevice: "" + imagePullSecrets: [] nameOverride: "" fullnameOverride: "" @@ -125,9 +133,6 @@ global: # By default, both var are set to empty, the model will be downloaded and saved to a tmp volume. modelUseHostPath: "" modelUsePVC: "" - # Enabling HPA will: - # - Ignore above replica count, as it will be controlled by HPA - # - Add example HPA scaling rules with thresholds suitable for Xeon deployments - # - Require custom metrics ConfigMap available in the main application chart - horizontalPodAutoscaler: - enabled: false + + # Prometheus Helm installation info for serviceMonitor + prometheusRelease: prometheus-stack diff --git a/microservices-connector/config/HPA/customMetrics.yaml b/microservices-connector/config/HPA/custom-metrics-configmap.yaml similarity index 73% rename from microservices-connector/config/HPA/customMetrics.yaml rename to microservices-connector/config/HPA/custom-metrics-configmap.yaml index 3709e578..2860c1d2 100644 --- a/microservices-connector/config/HPA/customMetrics.yaml +++ b/microservices-connector/config/HPA/custom-metrics-configmap.yaml @@ -4,10 +4,10 @@ apiVersion: v1 data: config.yaml: | rules: - - seriesQuery: '{__name__="tgi_request_inference_duration_sum",service="release-name-tgi"}' + - seriesQuery: '{__name__="tgi_request_inference_duration_sum",service="tgi"}' # Average request latency from TGI histograms, over 1 min # (0.001 divider add is to make sure there's always a valid value) - metricsQuery: 'rate(tgi_request_inference_duration_sum{service="release-name-tgi",<<.LabelMatchers>>}[1m]) / (0.001+rate(tgi_request_inference_duration_count{service="release-name-tgi",<<.LabelMatchers>>}[1m]))' + metricsQuery: 'rate(tgi_request_inference_duration_sum{service="tgi",<<.LabelMatchers>>}[1m]) / (0.001+rate(tgi_request_inference_duration_count{service="tgi",<<.LabelMatchers>>}[1m]))' name: matches: ^tgi_request_inference_duration_sum as: "tgi_request_latency" @@ -20,24 +20,24 @@ data: resource: namespace service: resource: service - - seriesQuery: '{__name__="te_request_inference_duration_sum",service="release-name-teirerank"}' + - seriesQuery: '{__name__="te_request_inference_duration_sum",service="teirerank"}' # Average request latency from TEI histograms, over 1 min - metricsQuery: 'rate(te_request_inference_duration_sum{service="release-name-teirerank",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="release-name-teirerank",<<.LabelMatchers>>}[1m]))' + metricsQuery: 'rate(te_request_inference_duration_sum{service="teirerank",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="teirerank",<<.LabelMatchers>>}[1m]))' name: matches: ^te_request_inference_duration_sum - as: "reranking_request_latency" + as: "teirerank_request_latency" resources: overrides: namespace: resource: namespace service: resource: service - - seriesQuery: '{__name__="te_request_inference_duration_sum",service="release-name-tei"}' + - seriesQuery: '{__name__="te_request_inference_duration_sum",service="tei"}' # Average request latency from TEI histograms, over 1 min - metricsQuery: 'rate(te_request_inference_duration_sum{service="release-name-tei",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="release-name-tei",<<.LabelMatchers>>}[1m]))' + metricsQuery: 'rate(te_request_inference_duration_sum{service="tei",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="tei",<<.LabelMatchers>>}[1m]))' name: matches: ^te_request_inference_duration_sum - as: "embedding_request_latency" + as: "tei_request_latency" resources: overrides: namespace: diff --git a/microservices-connector/config/HPA/tei.yaml b/microservices-connector/config/HPA/tei.yaml index f5fc5725..edbc698a 100644 --- a/microservices-connector/config/HPA/tei.yaml +++ b/microservices-connector/config/HPA/tei.yaml @@ -1,144 +1,5 @@ --- -# Source: tei/templates/configmap.yaml -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -apiVersion: v1 -kind: ConfigMap -metadata: - name: tei-config - labels: - helm.sh/chart: tei-0.8.0 - app.kubernetes.io/name: tei - app.kubernetes.io/instance: tei - app.kubernetes.io/version: "cpu-1.5" - app.kubernetes.io/managed-by: Helm -data: - MODEL_ID: "BAAI/bge-base-en-v1.5" - PORT: "2081" - http_proxy: "" - https_proxy: "" - no_proxy: "" - NUMBA_CACHE_DIR: "/tmp" - TRANSFORMERS_CACHE: "/tmp/transformers_cache" - HF_HOME: "/tmp/.cache/huggingface" - MAX_WARMUP_SEQUENCE_LENGTH: "512" ---- -# Source: tei/templates/service.yaml -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -apiVersion: v1 -kind: Service -metadata: - name: tei - labels: - helm.sh/chart: tei-0.8.0 - app.kubernetes.io/name: tei - app.kubernetes.io/instance: tei - app.kubernetes.io/version: "cpu-1.5" - app.kubernetes.io/managed-by: Helm -spec: - type: ClusterIP - ports: - - port: 80 - targetPort: 2081 - protocol: TCP - name: tei - selector: - app.kubernetes.io/name: tei - app.kubernetes.io/instance: tei ---- -# Source: tei/templates/deployment.yaml -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -apiVersion: apps/v1 -kind: Deployment -metadata: - name: tei - labels: - helm.sh/chart: tei-0.8.0 - app.kubernetes.io/name: tei - app.kubernetes.io/instance: tei - app.kubernetes.io/version: "cpu-1.5" - app.kubernetes.io/managed-by: Helm -spec: - # use explicit replica counts only of HorizontalPodAutoscaler is disabled - selector: - matchLabels: - app.kubernetes.io/name: tei - app.kubernetes.io/instance: tei - template: - metadata: - labels: - app.kubernetes.io/name: tei - app.kubernetes.io/instance: tei - spec: - securityContext: - {} - containers: - - name: tei - envFrom: - - configMapRef: - name: tei-config - - configMapRef: - name: extra-env-config - optional: true - securityContext: - {} - image: "ghcr.io/huggingface/text-embeddings-inference:cpu-1.5" - imagePullPolicy: IfNotPresent - args: - - "--auto-truncate" - volumeMounts: - - mountPath: /data - name: model-volume - - mountPath: /dev/shm - name: shm - - mountPath: /tmp - name: tmp - ports: - - name: http - containerPort: 2081 - protocol: TCP - livenessProbe: - failureThreshold: 24 - httpGet: - path: /health - port: http - initialDelaySeconds: 5 - periodSeconds: 5 - readinessProbe: - httpGet: - path: /health - port: http - initialDelaySeconds: 5 - periodSeconds: 5 - startupProbe: - failureThreshold: 120 - httpGet: - path: /health - port: http - initialDelaySeconds: 5 - periodSeconds: 5 - resources: - {} - volumes: - - name: model-volume - hostPath: - path: /mnt/opea-models - type: Directory - - name: shm - emptyDir: - medium: Memory - sizeLimit: 1Gi - - name: tmp - emptyDir: {} - # extra time to finish processing buffered requests before HPA forcibly terminates pod - terminationGracePeriodSeconds: 60 ---- -# Source: tei/templates/horizontalPodAutoscaler.yaml +# Source: tei/templates/horizontal-pod-autoscaler.yaml # Copyright (C) 2024 Intel Corporation # SPDX-License-Identifier: Apache-2.0 apiVersion: autoscaling/v2 @@ -156,8 +17,8 @@ spec: - type: Object object: metric: - # tei-embedding time metrics are in seconds - name: embedding_request_latency + # TEI time metrics are in seconds + name: tei_request_latency describedObject: apiVersion: v1 # get metric for named object of given type (in same namespace) diff --git a/microservices-connector/config/HPA/teirerank.yaml b/microservices-connector/config/HPA/teirerank.yaml index 181e8b2c..6436a89a 100644 --- a/microservices-connector/config/HPA/teirerank.yaml +++ b/microservices-connector/config/HPA/teirerank.yaml @@ -1,143 +1,5 @@ --- -# Source: teirerank/templates/configmap.yaml -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -apiVersion: v1 -kind: ConfigMap -metadata: - name: teirerank-config - labels: - helm.sh/chart: teirerank-0.8.0 - app.kubernetes.io/name: teirerank - app.kubernetes.io/instance: teirerank - app.kubernetes.io/version: "cpu-1.5" - app.kubernetes.io/managed-by: Helm -data: - MODEL_ID: "BAAI/bge-reranker-base" - PORT: "2082" - http_proxy: "" - https_proxy: "" - no_proxy: "" - NUMBA_CACHE_DIR: "/tmp" - TRANSFORMERS_CACHE: "/tmp/transformers_cache" - HF_HOME: "/tmp/.cache/huggingface" ---- -# Source: teirerank/templates/service.yaml -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -apiVersion: v1 -kind: Service -metadata: - name: teirerank - labels: - helm.sh/chart: teirerank-0.8.0 - app.kubernetes.io/name: teirerank - app.kubernetes.io/instance: teirerank - app.kubernetes.io/version: "cpu-1.5" - app.kubernetes.io/managed-by: Helm -spec: - type: ClusterIP - ports: - - port: 80 - targetPort: 2082 - protocol: TCP - name: teirerank - selector: - app.kubernetes.io/name: teirerank - app.kubernetes.io/instance: teirerank ---- -# Source: teirerank/templates/deployment.yaml -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -apiVersion: apps/v1 -kind: Deployment -metadata: - name: teirerank - labels: - helm.sh/chart: teirerank-0.8.0 - app.kubernetes.io/name: teirerank - app.kubernetes.io/instance: teirerank - app.kubernetes.io/version: "cpu-1.5" - app.kubernetes.io/managed-by: Helm -spec: - # use explicit replica counts only of HorizontalPodAutoscaler is disabled - selector: - matchLabels: - app.kubernetes.io/name: teirerank - app.kubernetes.io/instance: teirerank - template: - metadata: - labels: - app.kubernetes.io/name: teirerank - app.kubernetes.io/instance: teirerank - spec: - securityContext: - {} - containers: - - name: teirerank - envFrom: - - configMapRef: - name: teirerank-config - - configMapRef: - name: extra-env-config - optional: true - securityContext: - {} - image: "ghcr.io/huggingface/text-embeddings-inference:cpu-1.5" - imagePullPolicy: IfNotPresent - args: - - "--auto-truncate" - volumeMounts: - - mountPath: /data - name: model-volume - - mountPath: /dev/shm - name: shm - - mountPath: /tmp - name: tmp - ports: - - name: http - containerPort: 2082 - protocol: TCP - livenessProbe: - failureThreshold: 24 - httpGet: - path: /health - port: http - initialDelaySeconds: 5 - periodSeconds: 5 - readinessProbe: - httpGet: - path: /health - port: http - initialDelaySeconds: 5 - periodSeconds: 5 - startupProbe: - failureThreshold: 120 - httpGet: - path: /health - port: http - initialDelaySeconds: 5 - periodSeconds: 5 - resources: - {} - volumes: - - name: model-volume - hostPath: - path: /mnt/opea-models - type: Directory - - name: shm - emptyDir: - medium: Memory - sizeLimit: 1Gi - - name: tmp - emptyDir: {} - # extra time to finish processing buffered requests before HPA forcibly terminates pod - terminationGracePeriodSeconds: 60 ---- -# Source: teirerank/templates/horizontalPodAutoscaler.yaml +# Source: teirerank/templates/horizontal-pod-autoscaler.yaml # Copyright (C) 2024 Intel Corporation # SPDX-License-Identifier: Apache-2.0 apiVersion: autoscaling/v2 @@ -155,8 +17,8 @@ spec: - type: Object object: metric: - # tei-reranking time metrics are in seconds - name: reranking_request_latency + # TEI time metrics are in seconds + name: teirerank_request_latency describedObject: apiVersion: v1 # get metric for named object of given type (in same namespace) diff --git a/microservices-connector/config/HPA/tgi.yaml b/microservices-connector/config/HPA/tgi.yaml index aa047b37..cf28568c 100644 --- a/microservices-connector/config/HPA/tgi.yaml +++ b/microservices-connector/config/HPA/tgi.yaml @@ -1,135 +1,5 @@ --- -# Source: tgi/templates/configmap.yaml -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -apiVersion: v1 -kind: ConfigMap -metadata: - name: tgi-config - labels: - helm.sh/chart: tgi-0.8.0 - app.kubernetes.io/name: tgi - app.kubernetes.io/instance: tgi - app.kubernetes.io/version: "2.1.0" - app.kubernetes.io/managed-by: Helm -data: - MODEL_ID: "Intel/neural-chat-7b-v3-3" - PORT: "2080" - HF_TOKEN: "insert-your-huggingface-token-here" - http_proxy: "" - https_proxy: "" - no_proxy: "" - HABANA_LOGS: "/tmp/habana_logs" - NUMBA_CACHE_DIR: "/tmp" - TRANSFORMERS_CACHE: "/tmp/transformers_cache" - HF_HOME: "/tmp/.cache/huggingface" - CUDA_GRAPHS: "0" ---- -# Source: tgi/templates/service.yaml -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -apiVersion: v1 -kind: Service -metadata: - name: tgi - labels: - helm.sh/chart: tgi-0.8.0 - app.kubernetes.io/name: tgi - app.kubernetes.io/instance: tgi - app.kubernetes.io/version: "2.1.0" - app.kubernetes.io/managed-by: Helm -spec: - type: ClusterIP - ports: - - port: 80 - targetPort: 2080 - protocol: TCP - name: tgi - selector: - app.kubernetes.io/name: tgi - app.kubernetes.io/instance: tgi ---- -# Source: tgi/templates/deployment.yaml -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -apiVersion: apps/v1 -kind: Deployment -metadata: - name: tgi - labels: - helm.sh/chart: tgi-0.8.0 - app.kubernetes.io/name: tgi - app.kubernetes.io/instance: tgi - app.kubernetes.io/version: "2.1.0" - app.kubernetes.io/managed-by: Helm -spec: - # use explicit replica counts only of HorizontalPodAutoscaler is disabled - selector: - matchLabels: - app.kubernetes.io/name: tgi - app.kubernetes.io/instance: tgi - template: - metadata: - labels: - app.kubernetes.io/name: tgi - app.kubernetes.io/instance: tgi - spec: - securityContext: - {} - containers: - - name: tgi - envFrom: - - configMapRef: - name: tgi-config - - configMapRef: - name: extra-env-config - optional: true - securityContext: - {} - image: "ghcr.io/huggingface/text-generation-inference:2.2.0" - imagePullPolicy: IfNotPresent - volumeMounts: - - mountPath: /data - name: model-volume - - mountPath: /tmp - name: tmp - ports: - - name: http - containerPort: 2080 - protocol: TCP - livenessProbe: - failureThreshold: 24 - initialDelaySeconds: 5 - periodSeconds: 5 - tcpSocket: - port: http - readinessProbe: - initialDelaySeconds: 5 - periodSeconds: 5 - tcpSocket: - port: http - startupProbe: - failureThreshold: 120 - initialDelaySeconds: 5 - periodSeconds: 5 - tcpSocket: - port: http - resources: - {} - volumes: - - name: model-volume - hostPath: - path: /mnt/opea-models - type: Directory - - name: tmp - emptyDir: {} - # extra time to finish processing buffered requests before HPA forcibly terminates pod - terminationGracePeriodSeconds: 120 ---- -# Source: tgi/templates/horizontalPorAutoscaler.yaml +# Source: tgi/templates/horizontal-pod-autoscaler.yaml # Copyright (C) 2024 Intel Corporation # SPDX-License-Identifier: Apache-2.0 apiVersion: autoscaling/v2 @@ -142,7 +12,7 @@ spec: kind: Deployment name: tgi minReplicas: 1 - maxReplicas: 6 + maxReplicas: 4 metrics: - type: Object object: @@ -167,17 +37,14 @@ spec: policies: - type: Percent value: 25 - periodSeconds: 15 + periodSeconds: 90 scaleUp: selectPolicy: Max stabilizationWindowSeconds: 0 policies: - - type: Percent - value: 50 - periodSeconds: 15 - type: Pods - value: 2 - periodSeconds: 15 + value: 1 + periodSeconds: 90 --- # Source: tgi/templates/servicemonitor.yaml # Copyright (C) 2024 Intel Corporation diff --git a/microservices-connector/config/manifests/tei.yaml b/microservices-connector/config/manifests/tei.yaml index 2889b4d3..2f67a57b 100644 --- a/microservices-connector/config/manifests/tei.yaml +++ b/microservices-connector/config/manifests/tei.yaml @@ -110,12 +110,14 @@ spec: port: http initialDelaySeconds: 5 periodSeconds: 5 + timeoutSeconds: 2 readinessProbe: httpGet: path: /health port: http initialDelaySeconds: 5 periodSeconds: 5 + timeoutSeconds: 2 startupProbe: failureThreshold: 120 httpGet: @@ -136,6 +138,8 @@ spec: sizeLimit: 1Gi - name: tmp emptyDir: {} + # extra time to finish processing buffered requests before pod is forcibly terminated + terminationGracePeriodSeconds: 60 --- # Source: tei/templates/horizontalPodAutoscaler.yaml # Copyright (C) 2024 Intel Corporation diff --git a/microservices-connector/config/manifests/teirerank.yaml b/microservices-connector/config/manifests/teirerank.yaml index e412ecdb..20510a18 100644 --- a/microservices-connector/config/manifests/teirerank.yaml +++ b/microservices-connector/config/manifests/teirerank.yaml @@ -107,14 +107,16 @@ spec: httpGet: path: /health port: http - initialDelaySeconds: 5 - periodSeconds: 5 + initialDelaySeconds: 8 + periodSeconds: 8 + timeoutSeconds: 4 readinessProbe: httpGet: path: /health port: http - initialDelaySeconds: 5 - periodSeconds: 5 + initialDelaySeconds: 8 + periodSeconds: 8 + timeoutSeconds: 4 startupProbe: failureThreshold: 120 httpGet: @@ -135,6 +137,8 @@ spec: sizeLimit: 1Gi - name: tmp emptyDir: {} + # extra time to finish processing buffered requests before pod is forcibly terminated + terminationGracePeriodSeconds: 60 --- # Source: teirerank/templates/horizontalPodAutoscaler.yaml # Copyright (C) 2024 Intel Corporation diff --git a/microservices-connector/config/manifests/tgi.yaml b/microservices-connector/config/manifests/tgi.yaml index f1d10d73..aa1f4cec 100644 --- a/microservices-connector/config/manifests/tgi.yaml +++ b/microservices-connector/config/manifests/tgi.yaml @@ -65,8 +65,6 @@ metadata: app.kubernetes.io/version: "2.1.0" app.kubernetes.io/managed-by: Helm spec: - # use explicit replica counts only of HorizontalPodAutoscaler is disabled - replicas: 1 selector: matchLabels: app.kubernetes.io/name: tgi @@ -104,19 +102,22 @@ spec: protocol: TCP livenessProbe: failureThreshold: 24 - initialDelaySeconds: 5 - periodSeconds: 5 + initialDelaySeconds: 8 + periodSeconds: 8 + timeoutSeconds: 4 tcpSocket: port: http readinessProbe: - initialDelaySeconds: 5 - periodSeconds: 5 + initialDelaySeconds: 16 + periodSeconds: 8 + timeoutSeconds: 4 tcpSocket: port: http startupProbe: - failureThreshold: 120 - initialDelaySeconds: 5 + failureThreshold: 180 + initialDelaySeconds: 10 periodSeconds: 5 + timeoutSeconds: 2 tcpSocket: port: http resources: @@ -132,6 +133,8 @@ spec: sizeLimit: 1Gi - name: tmp emptyDir: {} + # extra time to finish processing buffered requests before pod is forcibly terminated + terminationGracePeriodSeconds: 120 --- # Source: tgi/templates/horizontalPorAutoscaler.yaml # Copyright (C) 2024 Intel Corporation diff --git a/microservices-connector/config/manifests/tgi_gaudi.yaml b/microservices-connector/config/manifests/tgi_gaudi.yaml index 92ac7f87..83be888a 100644 --- a/microservices-connector/config/manifests/tgi_gaudi.yaml +++ b/microservices-connector/config/manifests/tgi_gaudi.yaml @@ -66,8 +66,6 @@ metadata: app.kubernetes.io/version: "2.1.0" app.kubernetes.io/managed-by: Helm spec: - # use explicit replica counts only of HorizontalPodAutoscaler is disabled - replicas: 1 selector: matchLabels: app.kubernetes.io/name: tgi diff --git a/microservices-connector/config/manifests/tgi_nv.yaml b/microservices-connector/config/manifests/tgi_nv.yaml index e4d03cd6..d99a2fb9 100644 --- a/microservices-connector/config/manifests/tgi_nv.yaml +++ b/microservices-connector/config/manifests/tgi_nv.yaml @@ -64,8 +64,6 @@ metadata: app.kubernetes.io/version: "2.1.0" app.kubernetes.io/managed-by: Helm spec: - # use explicit replica counts only of HorizontalPodAutoscaler is disabled - replicas: 1 selector: matchLabels: app.kubernetes.io/name: tgi