You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the Storage of the TSDB is full, no alerts are fired related to it and the Watchdog alert continues to fire. This gives away the impression of Prometheus beeing fine despite it no longer beeing able to evaluate Alert-Expressions, and therefore not firing any alerts anymore.
In our environment, we fixed this by modifying the Watchdog alert-expression from vector(1) to present_over_time(prometheus_tsdb_head_max_time[1m]) != 0.
I already submitted this as a suggestion in the PR 2467, but I recognize this may not be the cleanest approach to fix this issue.
The alerts which AFAIK should fire are PrometheusMissingRuleEvaluations, PrometheusRuleFailures and PrometheusNotIngestingSamples. But because the metrics of these alerts are no longer scraped, their Expression-Evaluation fails and the alerts are not firing.
Did you expect to see some different?
Yes, either the Watchdog alert not firing anymore (because the alerting-chain is disrupted), one of the mentioned alerts firing (as they, from their description, make the most sense) or any other critical alert firing to alert of the situation.
How to reproduce it (as minimally and precisely as possible):
Let Prometheus scrape so many data so that it runs full, or fill up the storage of the TSDB manually:
kubectl exec -ti prometheus-prometheus-prometheus-0 -- sh
dd if=/dev/zero of=/prometheus/fillfile bs=1M
Environment
Prometheus Operator version:
v0.76.1
Kubernetes version information:
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.13", GitCommit:"7ba444e261616cb572b2c9e3aa6ee8876140f46a", GitTreeState:"clean", BuildDate:"2024-01-17T13:45:13Z", GoVersion:"go1.20.13", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15+vmware.1", GitCommit:"caaf37c79da07093b65edd62edb1d35b89f4e5c7", GitTreeState:"clean", BuildDate:"2024-03-27T05:25:15Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"
Kubernetes cluster kind:
It's a VMware Tanzu Kubernetes Grid Cluster.
Manifests:
kubePrometheus-prometheusRule.yaml:
...
- alert: Watchdog
annotations:
description: |
This is an alert meant to ensure that the entire alerting pipeline is functional.
This alert is always firing, therefore it should always be firing in Alertmanager
and always fire against a receiver. There are integrations with various notification
mechanisms that send a notification when this alert is not firing. For example the
"DeadMansSnitch" integration in PagerDuty.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/watchdog
summary: An alert that should always be firing to certify that Alertmanager is working properly.
expr: vector(1)
labels:
severity: none
...
prometheus-prometheusRule.yaml:
...
- alert: PrometheusMissingRuleEvaluations
annotations:
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusmissingruleevaluations
summary: Prometheus is missing rule evaluations due to slow rule group evaluation.
expr: |
increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 15m
labels:
severity: warning
...
- alert: PrometheusRuleFailures
annotations:
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusrulefailures
summary: Prometheus is failing rule evaluations.
expr: |
increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 15m
labels:
severity: critical
...
- alert: PrometheusNotIngestingSamples
annotations:
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotingestingsamples
summary: Prometheus is not ingesting samples.
expr: |
(
sum without(type) (rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m])) <= 0
and
(
sum without(scrape_job) (prometheus_target_metadata_cache_entries{job="prometheus-k8s",namespace="monitoring"}) > 0
or
sum without(rule_group) (prometheus_rule_group_rules{job="prometheus-k8s",namespace="monitoring"}) > 0
)
)
for: 10m
labels:
severity: warning
...
Prometheus Operator Logs:
None
Prometheus Logs:
The prometheus Pod logs the full storage, as expected:
ts=2024-10-14T07:51:30.338Z caller=scrape.go:1225 level=error component="scrape manager" scrape_pool=podMonitor/istio-system/istio-sidecars/0 target=http://11.32.17.13:15090/stats/prometheus msg="Scrape commit failed" err="write to WAL: log samples: write /prometheus/wal/00004293: no space left on device"
The text was updated successfully, but these errors were encountered:
What happened?
When the Storage of the TSDB is full, no alerts are fired related to it and the
Watchdog
alert continues to fire. This gives away the impression of Prometheus beeing fine despite it no longer beeing able to evaluate Alert-Expressions, and therefore not firing any alerts anymore.In our environment, we fixed this by modifying the
Watchdog
alert-expression fromvector(1)
topresent_over_time(prometheus_tsdb_head_max_time[1m]) != 0
.I already submitted this as a suggestion in the PR 2467, but I recognize this may not be the cleanest approach to fix this issue.
The alerts which AFAIK should fire are
PrometheusMissingRuleEvaluations
,PrometheusRuleFailures
andPrometheusNotIngestingSamples
. But because the metrics of these alerts are no longer scraped, their Expression-Evaluation fails and the alerts are not firing.Did you expect to see some different?
Yes, either the
Watchdog
alert not firing anymore (because the alerting-chain is disrupted), one of the mentioned alerts firing (as they, from their description, make the most sense) or any other critical alert firing to alert of the situation.How to reproduce it (as minimally and precisely as possible):
Let Prometheus scrape so many data so that it runs full, or fill up the storage of the TSDB manually:
Environment
v0.76.1
Kubernetes cluster kind:
It's a VMware Tanzu Kubernetes Grid Cluster.
Manifests:
kubePrometheus-prometheusRule.yaml:
prometheus-prometheusRule.yaml:
Prometheus Operator Logs:
None
Prometheus Logs:
The prometheus Pod logs the full storage, as expected:
The text was updated successfully, but these errors were encountered: