Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed #2537

Open
a-Tell opened this issue Oct 14, 2024 · 0 comments
Open

Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed #2537

a-Tell opened this issue Oct 14, 2024 · 0 comments
Labels

Comments

@a-Tell
Copy link

a-Tell commented Oct 14, 2024

What happened?

When the Storage of the TSDB is full, no alerts are fired related to it and the Watchdog alert continues to fire. This gives away the impression of Prometheus beeing fine despite it no longer beeing able to evaluate Alert-Expressions, and therefore not firing any alerts anymore.

In our environment, we fixed this by modifying the Watchdog alert-expression from vector(1) to present_over_time(prometheus_tsdb_head_max_time[1m]) != 0.

I already submitted this as a suggestion in the PR 2467, but I recognize this may not be the cleanest approach to fix this issue.

The alerts which AFAIK should fire are PrometheusMissingRuleEvaluations, PrometheusRuleFailures and PrometheusNotIngestingSamples. But because the metrics of these alerts are no longer scraped, their Expression-Evaluation fails and the alerts are not firing.

Did you expect to see some different?
Yes, either the Watchdog alert not firing anymore (because the alerting-chain is disrupted), one of the mentioned alerts firing (as they, from their description, make the most sense) or any other critical alert firing to alert of the situation.

How to reproduce it (as minimally and precisely as possible):
Let Prometheus scrape so many data so that it runs full, or fill up the storage of the TSDB manually:

kubectl exec -ti prometheus-prometheus-prometheus-0 -- sh
dd if=/dev/zero of=/prometheus/fillfile bs=1M

Environment

  • Prometheus Operator version:

v0.76.1

  • Kubernetes version information:
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.13", GitCommit:"7ba444e261616cb572b2c9e3aa6ee8876140f46a", GitTreeState:"clean", BuildDate:"2024-01-17T13:45:13Z", GoVersion:"go1.20.13", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15+vmware.1", GitCommit:"caaf37c79da07093b65edd62edb1d35b89f4e5c7", GitTreeState:"clean", BuildDate:"2024-03-27T05:25:15Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"
  • Kubernetes cluster kind:
    It's a VMware Tanzu Kubernetes Grid Cluster.

  • Manifests:
    kubePrometheus-prometheusRule.yaml:

...

    - alert: Watchdog
      annotations:
        description: |
          This is an alert meant to ensure that the entire alerting pipeline is functional.
          This alert is always firing, therefore it should always be firing in Alertmanager
          and always fire against a receiver. There are integrations with various notification
          mechanisms that send a notification when this alert is not firing. For example the
          "DeadMansSnitch" integration in PagerDuty.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/watchdog
        summary: An alert that should always be firing to certify that Alertmanager is working properly.
      expr: vector(1)
      labels:
        severity: none

...

prometheus-prometheusRule.yaml:

...

    - alert: PrometheusMissingRuleEvaluations
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusmissingruleevaluations
        summary: Prometheus is missing rule evaluations due to slow rule group evaluation.
      expr: |
        increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 15m
      labels:
        severity: warning

...

    - alert: PrometheusRuleFailures
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusrulefailures
        summary: Prometheus is failing rule evaluations.
      expr: |
        increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 15m
      labels:
        severity: critical

...

    - alert: PrometheusNotIngestingSamples
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotingestingsamples
        summary: Prometheus is not ingesting samples.
      expr: |
        (
          sum without(type) (rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m])) <= 0
        and
          (
            sum without(scrape_job) (prometheus_target_metadata_cache_entries{job="prometheus-k8s",namespace="monitoring"}) > 0
          or
            sum without(rule_group) (prometheus_rule_group_rules{job="prometheus-k8s",namespace="monitoring"}) > 0
          )
        )
      for: 10m
      labels:
        severity: warning

...
  • Prometheus Operator Logs:
    None

  • Prometheus Logs:
    The prometheus Pod logs the full storage, as expected:

ts=2024-10-14T07:51:30.338Z caller=scrape.go:1225 level=error component="scrape manager" scrape_pool=podMonitor/istio-system/istio-sidecars/0 target=http://11.32.17.13:15090/stats/prometheus msg="Scrape commit failed" err="write to WAL: log samples: write /prometheus/wal/00004293: no space left on device"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant