Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-installer job does not terminate properly #140

Open
voigt opened this issue May 20, 2024 · 2 comments
Open

node-installer job does not terminate properly #140

voigt opened this issue May 20, 2024 · 2 comments
Labels
kind/bug Something isn't working

Comments

@voigt
Copy link
Contributor

voigt commented May 20, 2024

As part of #68 I investigated an issue in the containerd restart routine. When the node-installer installs a runtime and restarts containerd, the corresponding pod terminates with status Unknown

Overview:

kubectl get job
NAME                            COMPLETIONS   DURATION   AGE
kwasm-worker-spin-v2-install    1/1           28s        21m
kubectl get po
NAME                                  READY   STATUS      RESTARTS   AGE
kwasm-worker-spin-v2-install-n82d9    0/1     Unknown     0          7m25s
kwasm-worker-spin-v2-install-rq78d    0/1     Completed   0          7m3s

Logs of Pod with status Unknown

kubectl logs kwasm-worker-spin-v2-install-n82d9 -c downloader
2024-05-20T20:49:40     INFO    start downloading shim from  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:42     INFO    download successful:
total 40M
drwxrwxrwx    1 root     root          46 May 20 20:49 .
drwxr-xr-x    1 root     root          48 May 20 20:49 ..
-rwxr-xr-x    1 1001     127        39.6M May  8 17:13 containerd-shim-spin-v2
kubectl logs kwasm-worker-spin-v2-install-n82d9 -c provisioner
2024/05/20 20:49:46 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=true
2024/05/20 20:49:46 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:49:46 INFO restarting containerd

Logs of Pod with status Completed

kubectl logs kwasm-worker-spin-v2-install-rq78d -c downloader
2024-05-20T20:49:57     INFO    start downloading shim from  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:59     INFO    download successful:
total 40M
drwxrwxrwx    1 root     root          46 May 20 20:49 .
drwxr-xr-x    1 root     root          48 May 20 20:49 ..
-rwxr-xr-x    1 1001     127        39.6M May  8 17:13 containerd-shim-spin-v2
kubectl logs kwasm-worker-spin-v2-install-rq78d -c provisioner
2024/05/20 20:50:00 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=false
2024/05/20 20:50:00 INFO runtime config already exists, skipping runtime=spin-v2
2024/05/20 20:50:00 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:50:00 INFO nothing changed, nothing more to do

The Completed pod only gets scheduled in the first place, as the first one did not terminated successfully; even though the actual job (rewriting containerd config and removing the binary) is done. As a result, the second run of the job has nothing left todo.

Description of Pod with Status Unknown

    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 20 May 2024 22:49:46 +0200
      Finished:     Mon, 20 May 2024 22:49:48 +0200
kubectl describe po kwasm-worker-spin-v2-install-n82d9
Name:             kwasm-worker-spin-v2-install-n82d9
Namespace:        default
Priority:         0
Service Account:  default
Node:             kwasm-worker/192.168.228.5
Start Time:       Mon, 20 May 2024 22:49:35 +0200
Labels:           batch.kubernetes.io/controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
                  batch.kubernetes.io/job-name=kwasm-worker-spin-v2-install
                  controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
                  job-name=kwasm-worker-spin-v2-install
Annotations:      <none>
Status:           Failed
IP:               10.244.2.2
IPs:
  IP:           10.244.2.2
Controlled By:  Job/kwasm-worker-spin-v2-install
Init Containers:
  downloader:
    Container ID:   containerd://7f63983e513efa392e3cc684bf53d2553aeb898b4bfe08fb22229fbae83406cb
    Image:          ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader
    Image ID:       ghcr.io/spinkube/shim-downloader@sha256:719f54c518fc0fc65abbe8ac27978ea188d13faee23530544faf9d622aa2be92
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 20 May 2024 22:49:40 +0200
      Finished:     Mon, 20 May 2024 22:49:42 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      SHIM_NAME:      spin-v2
      SHIM_LOCATION:  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz
    Mounts:
      /assets from shim-download (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro)
Containers:
  provisioner:
    Container ID:  containerd://92dd4c994b2fc95d269b5de630c00f55fff233d04d1d649a6b69ce512936278b
    Image:         ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader
    Image ID:      ghcr.io/spinkube/node-installer@sha256:fcbfa4d8197d3de3b9953219af6a8784f23abf7d798150b2c2a606daaeebe6df
    Port:          <none>
    Host Port:     <none>
    Args:
      install
      -H
      /mnt/node-root
      -r
      spin-v2
    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 20 May 2024 22:49:46 +0200
      Finished:     Mon, 20 May 2024 22:49:47 +0200
    Ready:          False
    Restart Count:  0
    Environment:
      HOST_ROOT:            /mnt/node-root
      SHIM_FETCH_STRATEGY:  /mnt/node-root
    Mounts:
      /assets from shim-download (rw)
      /mnt/node-root from root-mount (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  shim-download:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  root-mount:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  kube-api-access-wnr2x:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason   Age   From     Message
  ----    ------   ----  ----     -------
  Normal  Pulling  25m   kubelet  Pulling image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader"
  Normal  Pulled   25m   kubelet  Successfully pulled image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader" in 4.108s (4.108s including waiting)
  Normal  Created  25m   kubelet  Created container downloader
  Normal  Started  25m   kubelet  Started container downloader
  Normal  Pulling  25m   kubelet  Pulling image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader"
  Normal  Pulled   25m   kubelet  Successfully pulled image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader" in 3.105s (3.105s including waiting)
  Normal  Created  25m   kubelet  Created container provisioner
  Normal  Started  25m   kubelet  Started container provisioner
Entire resource of Job (e.g. for recreation of the bug)
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    kwasm.sh/nodeName: kwasm-worker
    kwasm.sh/operation: install
    kwasm.sh/shimName: spin-v2
  labels:
    kwasm-worker-spin-v2-install: "true"
    kwasm.sh/job: "true"
    kwasm.sh/operation: install
    kwasm.sh/shimName: spin-v2
  name: kwasm-worker-spin-v2-install
  namespace: default
spec:
  backoffLimit: 6
  completionMode: NonIndexed
  completions: 1
  manualSelector: false
  parallelism: 1
  podReplacementPolicy: TerminatingOrFailed
  selector:
    matchLabels:
      batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
        batch.kubernetes.io/job-name: kwasm-worker-spin-v2-install
        controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
        job-name: kwasm-worker-spin-v2-install
    spec:
      containers:
      - args:
        - install
        - -H
        - /mnt/node-root
        - -r
        - spin-v2
        env:
        - name: HOST_ROOT
          value: /mnt/node-root
        - name: SHIM_FETCH_STRATEGY
          value: /mnt/node-root
        image: ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader
        imagePullPolicy: IfNotPresent
        name: provisioner
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/node-root
          name: root-mount
        - mountPath: /assets
          name: shim-download
      dnsPolicy: ClusterFirst
      hostPID: true
      initContainers:
      - env:
        - name: SHIM_NAME
          value: spin-v2
        - name: SHIM_LOCATION
          value: https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz
        image: ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader
        imagePullPolicy: IfNotPresent
        name: downloader
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /assets
          name: shim-download
      nodeName: kwasm-worker
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: shim-download
      - hostPath:
          path: /
          type: ""
        name: root-mount
status:
  completionTime: "2024-05-20T20:50:03Z"
  conditions:
  - lastProbeTime: "2024-05-20T20:50:03Z"
    lastTransitionTime: "2024-05-20T20:50:03Z"
    status: "True"
    type: Complete
  failed: 1
  ready: 0
  startTime: "2024-05-20T20:49:35Z"
  succeeded: 1
  terminating: 0
  uncountedTerminatedPods: {}

While the goal of installing/uninstalling the shim is achieved, this is not a desired behavior and desires for a solution.

@voigt
Copy link
Contributor Author

voigt commented May 20, 2024

The install-pods of kwasm do not terminate with status Unknown, but Completed. The main difference is, that kwasms install script uses the system schedulers restart functionality.

https://github.com/KWasm/kwasm-node-installer/blob/0ee6ec416f56d35449fbe2f6af072a8643e61686/script/installer.sh#L65

In case of systemd this means, that containerd receives a SIGTERM and only after 90 seconds a SIGKILL (source).

node-installer directly sends a SIGHUP to the containerd process, which seems to me to be the issue.

err = syscall.Kill(pid, syscall.SIGHUP)

@voigt voigt added the kind/bug Something isn't working label May 20, 2024
@vdice
Copy link
Collaborator

vdice commented Nov 6, 2024

Seeing similar behavior in the uninstall jobs (when deleting a shim). The first pod deletes the shim and restarts containerd but ends with status Unknown. The second and subsequent pods then enter a failure loop, failing with e.g.:

$ k -n rcm logs kind-worker1-spin-v2-uninstall-6m57l
2024/11/06 22:39:35 INFO uninstall called shim=spin-v2
2024/11/06 22:39:35 ERROR failed to uninstall error="failed to delete shim '/opt/kwasm/bin/spin-v2': shim spin-v2 not installed"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants