Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still connecting to unix:///var/lib/kubelet/csi-plugins/*.csi.alibabacloud.com/csi.sock #1127

Open
lliiang opened this issue Aug 6, 2024 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@lliiang
Copy link

lliiang commented Aug 6, 2024

What happened:

集群上其中两个节点一直csi-plugin-h4qhz 报错重启,以下是日志截图

图片
图片
图片
图片

以下是container日志
csi-plugin-h4qhz-nas-driver-registrar.log
csi-plugin-h4qhz-disk-driver-registrar.log

csi-plugin-h4qhz-csi-plugin.log
csi-plugin-h4qhz-oss-driver-registrar.log

What you expected to happen:

集群有十几个节点,就其中两个节点报错,下面是DaemonSet的yaml
kind: DaemonSet apiVersion: apps/v1 metadata: name: csi-plugin namespace: kube-system uid: 509d3cfc-0dbe-4ebd-8d79-3b8c52774d17 resourceVersion: '601102482' generation: 5 creationTimestamp: '2023-03-21T14:45:10Z' annotations: deprecated.daemonset.template.generation: '5' spec: selector: matchLabels: app: csi-plugin template: metadata: creationTimestamp: null labels: app: csi-plugin annotations: kubectl.kubernetes.io/restartedAt: '2024-06-19T22:22:37+08:00' spec: nodeSelector: kubernetes.io/os: linux restartPolicy: Always serviceAccountName: csi-admin hostPID: true schedulerName: default-scheduler hostNetwork: true affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: type operator: NotIn values: - virtual-kubelet terminationGracePeriodSeconds: 30 securityContext: {} containers: - name: disk-driver-registrar image: >- registry-cn-hangzhou.ack.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.1-038aeb6-aliyun args: - '--v=5' - >- --csi-address=/var/lib/kubelet/csi-plugins/diskplugin.csi.alibabacloud.com/csi.sock - >- --kubelet-registration-path=/var/lib/kubelet/csi-plugins/diskplugin.csi.alibabacloud.com/csi.sock resources: limits: cpu: 500m memory: 1Gi requests: cpu: 10m memory: 16Mi volumeMounts: - name: kubelet-dir mountPath: /var/lib/kubelet - name: registration-dir mountPath: /registration terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent - name: nas-driver-registrar image: >- registry-cn-hangzhou.ack.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.1-038aeb6-aliyun args: - '--v=5' - >- --csi-address=/var/lib/kubelet/csi-plugins/nasplugin.csi.alibabacloud.com/csi.sock - >- --kubelet-registration-path=/var/lib/kubelet/csi-plugins/nasplugin.csi.alibabacloud.com/csi.sock resources: limits: cpu: 500m memory: 1Gi requests: cpu: 10m memory: 16Mi volumeMounts: - name: kubelet-dir mountPath: /var/lib/kubelet/ - name: registration-dir mountPath: /registration terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent - name: oss-driver-registrar image: >- registry-cn-hangzhou.ack.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.1-038aeb6-aliyun args: - '--v=5' - >- --csi-address=/var/lib/kubelet/csi-plugins/ossplugin.csi.alibabacloud.com/csi.sock - >- --kubelet-registration-path=/var/lib/kubelet/csi-plugins/ossplugin.csi.alibabacloud.com/csi.sock resources: limits: cpu: 500m memory: 1Gi requests: cpu: 10m memory: 16Mi volumeMounts: - name: kubelet-dir mountPath: /var/lib/kubelet/ - name: registration-dir mountPath: /registration terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent - resources: limits: cpu: 500m memory: 1Gi requests: cpu: 100m memory: 128Mi readinessProbe: httpGet: path: /healthz port: healthz scheme: HTTP initialDelaySeconds: 10 timeoutSeconds: 5 periodSeconds: 30 successThreshold: 1 failureThreshold: 5 terminationMessagePath: /dev/termination-log name: csi-plugin livenessProbe: httpGet: path: /healthz port: healthz scheme: HTTP initialDelaySeconds: 10 timeoutSeconds: 5 periodSeconds: 30 successThreshold: 1 failureThreshold: 5 env: - name: KUBE_NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName - name: CSI_ENDPOINT value: >- unix://var/lib/kubelet/csi-plugins/driverplugin.csi.alibabacloud.com-replace/csi.sock - name: MAX_VOLUMES_PERNODE value: '15' - name: SERVICE_TYPE value: plugin - name: ACCESS_KEY_ID value: LTAI5t6KKbiyequnsVeJHY55 - name: ACCESS_KEY_SECRET value: S6UvK6rIVheVO4Y4fAiyVl2PZXNRMs securityContext: privileged: true allowPrivilegeEscalation: true ports: - name: healthz hostPort: 11260 containerPort: 11260 protocol: TCP imagePullPolicy: IfNotPresent volumeMounts: - name: kubelet-dir mountPath: /var/lib/kubelet/ mountPropagation: Bidirectional - name: etc mountPath: /host/etc - name: host-log mountPath: /var/log/ - name: ossconnectordir mountPath: /host/usr/ - name: container-dir mountPath: /var/lib/container mountPropagation: Bidirectional - name: host-dev mountPath: /dev mountPropagation: HostToContainer - name: addon-token readOnly: true mountPath: /var/addon - name: fuse-metrics-dir mountPath: /host/var/run/ terminationMessagePolicy: File image: >- registry-cn-hangzhou.ack.aliyuncs.com/acs/csi-plugin:v1.24.9-74f8490-aliyun args: - '--endpoint=$(CSI_ENDPOINT)' - '--v=2' - '--driver=oss,nas,disk' serviceAccount: csi-admin volumes: - name: fuse-metrics-dir hostPath: path: /var/run/ type: DirectoryOrCreate - name: registration-dir hostPath: path: /var/lib/kubelet/plugins_registry type: DirectoryOrCreate - name: container-dir hostPath: path: /var/lib/container type: DirectoryOrCreate - name: kubelet-dir hostPath: path: /var/lib/kubelet type: Directory - name: host-dev hostPath: path: /dev type: '' - name: host-log hostPath: path: /var/log/ type: '' - name: etc hostPath: path: /etc type: '' - name: ossconnectordir hostPath: path: /usr/ type: '' - name: addon-token secret: secretName: addon.csi.token items: - key: addon.token.config path: token-config defaultMode: 420 optional: true dnsPolicy: ClusterFirst tolerations: - operator: Exists priorityClassName: system-node-critical updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 20% maxSurge: 0 revisionHistoryLimit: 10 status: currentNumberScheduled: 15 numberMisscheduled: 0 desiredNumberScheduled: 15 numberReady: 13 observedGeneration: 5 updatedNumberScheduled: 15 numberAvailable: 13 numberUnavailable: 2

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • CSI driver version (image tag of csi-plugin container):

  • Deployment method (where you got the YAML files, what modifications you made, etc.):

  • Kubernetes version (use kubectl version):
    k8s 1.26

  • Cloud provider or hardware configuration (e.g. Alibaba Cloud ECS instance type):
    集群节点使用的是阿里云ecs

  • OS (e.g: cat /etc/os-release):

  • Kernel (e.g. uname -a):

  • Network plugin and version (if this is a network-related bug):

  • Others:

@lliiang lliiang added the kind/bug Categorizes issue or PR as related to a bug. label Aug 6, 2024
@huww98
Copy link
Contributor

huww98 commented Aug 7, 2024

Why is your filesystem read-only? Is it intentional? What OS are you using?

@lliiang
Copy link
Author

lliiang commented Aug 7, 2024

Why is your filesystem read-only? Is it intentional? What OS are you using?

my cluster is openshift 4.13

the node os is coreos

Comparing logs between normal pods and abnormal pods.
图片

@huww98
Copy link
Contributor

huww98 commented Aug 7, 2024

OK, maybe we should never write file into /usr, which is expected to be managed by OS package manager.

You can try set env DISABLE_CSIPLUGIN_CONNECTOR=true. Or upgrade CSI, we have limited the number of retries to 5.

Comparing logs between normal pods and abnormal pods.

I think these logs come from different CSI version.

@lliiang
Copy link
Author

lliiang commented Aug 8, 2024

hello, does csi-plugin has debug log config? how to open debug log,i want to collect debug log to platform

@huww98
Copy link
Contributor

huww98 commented Aug 8, 2024

No. The default log level already outputs almost all the logs.

@huww98
Copy link
Contributor

huww98 commented Aug 8, 2024

OK, maybe we should never write file into /usr, which is expected to be managed by OS package manager.

We decided not to fix this one. Because we have planned to remove the connector all together in the future.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 6, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants