-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API #644
Labels
kind/bug
Categorizes issue or PR as related to a bug.
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
Comments
k8s-ci-robot
added
the
needs-triage
Indicates an issue or PR lacks a `triage/foo` label and requires one.
label
Feb 27, 2024
saad946
changed the title
invalid metrics (1 invalid out of 1), first error is: failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
Feb 27, 2024
/cc @CatherineF-dev |
k8s-ci-robot
added
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
and removed
needs-triage
Indicates an issue or PR lacks a `triage/foo` label and requires one.
labels
Mar 7, 2024
Just curious, How does the raw data for |
something like this @dvp34 {
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"DCGM_FI_DRIVER_VERSION": "535.171.04",
"Hostname": "qxzg-l4server",
"UUID": "GPU-557dbd17-5aa5-ade0-c563-d44fee17f8bc",
"__name__": "DCGM_FI_DEV_GPU_UTIL",
"container": "nvidia-dcgm-exporter",
"device": "nvidia1",
"endpoint": "gpu-metrics",
"exported_container": "triton",
"exported_namespace": "llm",
"exported_pod": "qwen-1gpu-75455d6c96-7jcxq",
"gpu": "1",
"instance": "10.42.0.213:9400",
"job": "nvidia-dcgm-exporter",
"modelName": "NVIDIA L4",
"namespace": "gpu-operator",
"pod": "nvidia-dcgm-exporter-rlhcx",
"service": "nvidia-dcgm-exporter"
},
"value": [
1719909159.405,
"0"
]
},
{
"metric": {
"DCGM_FI_DRIVER_VERSION": "535.171.04",
"Hostname": "qxzg-l4server",
"UUID": "GPU-557dbd17-5aa5-ade0-c563-d44fee17f8bc",
"__name__": "DCGM_FI_DEV_GPU_UTIL",
"container": "triton",
"device": "nvidia1",
"gpu": "1",
"instance": "10.42.0.213:9400",
"job": "gpu-metrics",
"kubernetes_node": "qxzg-l4server",
"modelName": "NVIDIA L4",
"namespace": "llm",
"pod": "qwen-1gpu-75455d6c96-7jcxq"
},
"value": [
1719909159.405,
"0"
]
},
{
"metric": {
"DCGM_FI_DRIVER_VERSION": "535.171.04",
"Hostname": "qxzg-l4server",
"UUID": "GPU-ec1a0983-4e27-c5c1-16f7-534319ffb62c",
"__name__": "DCGM_FI_DEV_GPU_UTIL",
"container": "nvidia-dcgm-exporter",
"device": "nvidia0",
"endpoint": "gpu-metrics",
"exported_container": "triton",
"exported_namespace": "llm",
"exported_pod": "qwen2-d63eff62-2f6a-427d-b231-e7693a1c2915-747c599cb6-4xjlb",
"gpu": "0",
"instance": "10.42.0.213:9400",
"job": "nvidia-dcgm-exporter",
"modelName": "NVIDIA L4",
"namespace": "gpu-operator",
"pod": "nvidia-dcgm-exporter-rlhcx",
"service": "nvidia-dcgm-exporter"
},
"value": [
1719909159.405,
"0"
]
},
{
"metric": {
"DCGM_FI_DRIVER_VERSION": "535.171.04",
"Hostname": "qxzg-l4server",
"UUID": "GPU-ec1a0983-4e27-c5c1-16f7-534319ffb62c",
"__name__": "DCGM_FI_DEV_GPU_UTIL",
"container": "triton",
"device": "nvidia0",
"gpu": "0",
"instance": "10.42.0.213:9400",
"job": "gpu-metrics",
"kubernetes_node": "qxzg-l4server",
"modelName": "NVIDIA L4",
"namespace": "llm",
"pod": "qwen2-d63eff62-2f6a-427d-b231-e7693a1c2915-747c599cb6-4xjlb"
},
"value": [
1719909159.405,
"0"
]
}
]
}
}
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
kind/bug
Categorizes issue or PR as related to a bug.
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
What happened?:
I am constantly having this with one of my service hpa which is configured to scale based on custom metrics. Sometime hpa shows able to scale to True and able to get custom metrics most of the time not. Because of that hpa is not able to scale down the pods.
This our hpa description for one of the affected service.
Affected service hpa description:
While the other service using same hpa configuration not showing this error while describing its hpa.
This hpa description from another service.
Running service hpa:
This is a Random behaviour we observed in both services that sometime its able to collect custom metric and sometime not.
What did you expect to happen?:
Expect the same behaviour of prometheus adapter and hpa among the services if using same configuration for both services.
Please provide the prometheus-adapter config:
prometheus-adapter config
When checking if metrics exist or not, got this response:
Please provide the HPA resource used for autoscaling:
HPA yaml
HPA yaml for both service is here:
Not Working one:
Working One:
Please provide the HPA status:
We observed these events in both services time to time, also sometime it is able to collect the metric for ServiceB
but not for serviceA most of the time.
And it is the HPA status, it seems it is able to get the memory utilization, but while we describe hpa we observed issues as stated earlier that hpa is unable to collect metrics neither trigger scaling activity.
Please provide the prometheus-adapter logs with -v=6 around the time the issue happened:
prometheus-adapter logs
Anything else we need to know?:
Environment:
prometheus-adapter version: prometheus-adapter-3.2.2 v0.9.1
prometheus version: kube-prometheus-stack-56.6.2 v0.71.2
Kubernetes version (use
kubectl version
): Client Version: v1.28.3-eks-e71965bKustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.26.12-eks-5e0fdde
Cloud provider or hardware configuration: AWS EKS
Other info:
The text was updated successfully, but these errors were encountered: