Loki retention stops working after a while #15479

janxe · 2024-12-18T16:01:08Z

Describe the bug
Hi, it seems that log retention for our loki deployment stops working after we have redeployed loki a few times.
And all logs seems to be stored forever
I can't say for sure what triggers this, only that it has occured atleast 2-3 times now, when we have redeployed loki a few times with changed size of pvc and other misc changes, while reusing the same s3-buckets.

To Reproduce

deploy loki via helm-chart
delete loki deployment
redeploy loki with different settings, different size for pvc.s, replicas etc, while reusing same s3-buckets as previous deployments

Expected behavior
Loki log retentions still being applied to logs stored from later deployment, and current

Environment:

Infrastructure: company kubernetes cluster, local hitatchi s3 storage
Deployment tool: helm, argocd

Screenshots, Promtail config, or terminal output
Here is the configuration used, together with some loki log errors we can see.
loki_index_20075 seems to be the newest index stored in the s3 bucket

loki:
  podAnnotations:
    kyverno.io/inject-cacerts: enabled

  podSecurityContext:
    runAsNonRoot: true
    runAsGroup: 10001
    runAsUser: 10001
    fsGroup: 10001
    seccompProfile:
      type: RuntimeDefault

  auth_enabled: true

  analytics:
    reporting_enabled: false

  compactor:
    retention_enabled: true
    delete_request_store: s3
    retention_delete_delay: 10m
  limits_config:
    retention_period: 49h
    max_streams_per_user: 100000

  frontend:
    max_outstanding_per_tenant: 4096

  commonConfig:
    ring:
      kvstore:
        store: memberlist

  storage:
    filesystem: null
    s3:
      endpoint: "${LOKI_ENDPOINT}"
      accessKeyId: "${LOKI_ACCESS_KEY_ID}"
      secretAccessKey: "${LOKI_SECRET_ACCESS_KEY}"
      s3ForcePathStyle: false
    bucketNames:
      chunks: loki-chunk-dev
      ruler: loki-ruler-dev
      admin: loki-admin-dev

  # tsdb and v13 are needed for Loki v3.0.0
  # https://grafana.com/docs/loki/latest/operations/storage/tsdb/
  schemaConfig:
    configs:
      - from: "2022-01-11"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v12
        store: boltdb-shipper

      - from: "2024-09-03"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v13
        store: tsdb

  # the TSDB index dispatches many more, but each individually smaller, requests.
  # We increase the pending request queue sizes to compensate.
  query_scheduler:
    max_outstanding_requests_per_tenant: 32768

  # Each `querier` component process runs a number of parallel workers to process queries simultaneously.
  querier:
    max_concurrent: 16
    multi_tenant_queries_enabled: true

# Tests need to be disabled in order to disable Canary
test:
  enabled: false

lokiCanary:
  enabled: false

gateway:
  replicas: 3
  resources:
    requests:
      memory: 25Mi
      cpu: 5m
  podSecurityContext:
    fsGroup: 101
    runAsGroup: 101
    runAsNonRoot: true
    runAsUser: 101
    seccompProfile:
      type: RuntimeDefault

write:
  replicas: 2
  resources:
    limits:
      memory: 1500Mi
      cpu: 300m
    requests:
      memory: 1500Mi
      cpu: 50m
  persistence:
    size: 3Gi
  affinity:
    podAntiAffinity:
      # this need to be empty otherwise it will conflict with preferred
      requiredDuringSchedulingIgnoredDuringExecution:

      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app.kubernetes.io/component: write
                app.kubernetes.io/instance: loki
                app.kubernetes.io/name: loki
            topologyKey: kubernetes.io/hostname
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: LOKI_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-block-storage
          key: LOKI_ENDPOINT
    - name: LOKI_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-block-storage
          key: LOKI_ACCESS_KEY_ID
    - name: LOKI_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-block-storage
          key: LOKI_SECRET_ACCESS_KEY

read:
  replicas: 2
  resources:
    limits:
      memory: 700Mi
      cpu: 400m
    requests:
      memory: 256Mi
      cpu: 100m
  persistence:
    size: 5Gi
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: LOKI_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-block-storage
          key: LOKI_ENDPOINT
    - name: LOKI_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-block-storage
          key: LOKI_ACCESS_KEY_ID
    - name: LOKI_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-block-storage
          key: LOKI_SECRET_ACCESS_KEY

backend:
  replicas: 2
  resources:
    limits:
      memory: 1300Mi
      cpu: 600m
    requests:
      memory: 592Mi
      cpu: 50m
  persistence:
    size: 7Gi
    enableStatefulSetAutoDeletePVC: true
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: LOKI_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-block-storage
          key: LOKI_ENDPOINT
    - name: LOKI_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-block-storage
          key: LOKI_ACCESS_KEY_ID
    - name: LOKI_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-block-storage
          key: LOKI_SECRET_ACCESS_KEY
  extraVolumes:
    - name: override-rules
      configMap:
        name: loki-tenant-override-rules
  extraVolumeMounts:
    - name: override-rules
      mountPath: /etc/loki/config/override/

sidecar:
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
    readOnlyRootFilesystem: true

memcached:
  podSecurityContext:
    runAsNonRoot: true
    runAsGroup: 11211
    runAsUser: 11211
    seccompProfile:
      type: RuntimeDefault

resultsCache:
  enabled: true
  resources:
    limits:
      memory: 1331Mi
    requests:
      cpu: 20m
      memory: 1331Mi
  defaultValidity: 12h
  allocatedMemory: 1024

chunksCache:
  enabled: true
  resources:
    limits:
      memory: 2662Mi
    requests:
      cpu: 20m
      memory: 2662Mi
  defaultValidity: 0s
  allocatedMemory: 2048

rbac:
  namespaced: true

level=error ts=2024-12-18T15:00:46.160085038Z caller=compactor.go:571 msg="failed to apply retention" err="SerializationError: empty response payload\n\tstatus code: 204, request id: , host id: \ncaused by: EOF"
level=error ts=2024-12-18T15:00:46.159994556Z caller=compactor.go:662 msg="failed to compact files" table=loki_index_20075 err="SerializationError: empty response payload\n\tstatus code: 204, request id: , host id: \ncaused by: EOF"
level=info ts=2024-12-18T15:00:35.726949025Z caller=expiration.go:78 msg="overall smallest retention period 1734357635.726, default smallest retention period 1734357635.726"
level=error ts=2024-12-18T14:55:33.479896021Z caller=compactor.go:548 msg="failed to run compaction" err="SerializationError: empty response payload\n\tstatus code: 204, request id: , host id: \ncaused by: EOF"
level=error ts=2024-12-18T14:55:33.479838188Z caller=compactor.go:662 msg="failed to compact files" table=loki_index_20075 err="SerializationError: empty response payload\n\tstatus code: 204, request id: , host id: \ncaused by: EOF"
level=error ts=2024-12-18T14:45:35.726608822Z caller=compactor.go:561 msg="failed to apply retention" err="SerializationError: empty response payload\n\tstatus code: 204, request id: , host id: \ncaused by: EOF"
level=error ts=2024-12-18T14:45:35.72653095Z caller=compactor.go:662 msg="failed to compact files" table=loki_index_20075 err="SerializationError: empty response payload\n\tstatus code: 204, request id: , host id: \ncaused by: EOF"
level=info ts=2024-12-18T14:45:24.901131705Z caller=expiration.go:78 msg="overall smallest retention period 1734356724.901, default smallest retention period 1734356724.901"
level=info ts=2024-12-18T14:45:24.90106913Z caller=compactor.go:592 msg="compactor started"

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki retention stops working after a while #15479

Loki retention stops working after a while #15479

janxe commented Dec 18, 2024

Loki retention stops working after a while #15479

Loki retention stops working after a while #15479

Comments

janxe commented Dec 18, 2024