Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed #2537

a-Tell · 2024-10-14T08:30:44Z

What happened?

When the Storage of the TSDB is full, no alerts are fired related to it and the Watchdog alert continues to fire. This gives away the impression of Prometheus beeing fine despite it no longer beeing able to evaluate Alert-Expressions, and therefore not firing any alerts anymore.

In our environment, we fixed this by modifying the Watchdog alert-expression from vector(1) to present_over_time(prometheus_tsdb_head_max_time[1m]) != 0.

I already submitted this as a suggestion in the PR 2467, but I recognize this may not be the cleanest approach to fix this issue.

The alerts which AFAIK should fire are PrometheusMissingRuleEvaluations, PrometheusRuleFailures and PrometheusNotIngestingSamples. But because the metrics of these alerts are no longer scraped, their Expression-Evaluation fails and the alerts are not firing.

Did you expect to see some different?
Yes, either the Watchdog alert not firing anymore (because the alerting-chain is disrupted), one of the mentioned alerts firing (as they, from their description, make the most sense) or any other critical alert firing to alert of the situation.

How to reproduce it (as minimally and precisely as possible):
Let Prometheus scrape so many data so that it runs full, or fill up the storage of the TSDB manually:

kubectl exec -ti prometheus-prometheus-prometheus-0 -- sh
dd if=/dev/zero of=/prometheus/fillfile bs=1M

Environment

Prometheus Operator version:

v0.76.1

Kubernetes version information:

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.13", GitCommit:"7ba444e261616cb572b2c9e3aa6ee8876140f46a", GitTreeState:"clean", BuildDate:"2024-01-17T13:45:13Z", GoVersion:"go1.20.13", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15+vmware.1", GitCommit:"caaf37c79da07093b65edd62edb1d35b89f4e5c7", GitTreeState:"clean", BuildDate:"2024-03-27T05:25:15Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"

Kubernetes cluster kind:
It's a VMware Tanzu Kubernetes Grid Cluster.
Manifests:
kubePrometheus-prometheusRule.yaml:

...

    - alert: Watchdog
      annotations:
        description: |
          This is an alert meant to ensure that the entire alerting pipeline is functional.
          This alert is always firing, therefore it should always be firing in Alertmanager
          and always fire against a receiver. There are integrations with various notification
          mechanisms that send a notification when this alert is not firing. For example the
          "DeadMansSnitch" integration in PagerDuty.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/watchdog
        summary: An alert that should always be firing to certify that Alertmanager is working properly.
      expr: vector(1)
      labels:
        severity: none

...

prometheus-prometheusRule.yaml:

...

    - alert: PrometheusMissingRuleEvaluations
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusmissingruleevaluations
        summary: Prometheus is missing rule evaluations due to slow rule group evaluation.
      expr: |
        increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 15m
      labels:
        severity: warning

...

    - alert: PrometheusRuleFailures
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusrulefailures
        summary: Prometheus is failing rule evaluations.
      expr: |
        increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 15m
      labels:
        severity: critical

...

    - alert: PrometheusNotIngestingSamples
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotingestingsamples
        summary: Prometheus is not ingesting samples.
      expr: |
        (
          sum without(type) (rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m])) <= 0
        and
          (
            sum without(scrape_job) (prometheus_target_metadata_cache_entries{job="prometheus-k8s",namespace="monitoring"}) > 0
          or
            sum without(rule_group) (prometheus_rule_group_rules{job="prometheus-k8s",namespace="monitoring"}) > 0
          )
        )
      for: 10m
      labels:
        severity: warning

...

Prometheus Operator Logs:
None
Prometheus Logs:
The prometheus Pod logs the full storage, as expected:

ts=2024-10-14T07:51:30.338Z caller=scrape.go:1225 level=error component="scrape manager" scrape_pool=podMonitor/istio-system/istio-sidecars/0 target=http://11.32.17.13:15090/stats/prometheus msg="Scrape commit failed" err="write to WAL: log samples: write /prometheus/wal/00004293: no space left on device"

The text was updated successfully, but these errors were encountered:

a-Tell added the kind/bug label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed #2537

Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed #2537

a-Tell commented Oct 14, 2024 •

edited

Loading

Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed #2537

Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed #2537

Comments

a-Tell commented Oct 14, 2024 • edited Loading

a-Tell commented Oct 14, 2024 •

edited

Loading