scale.yml race condition causing calico networking to malfunction #10928

Rickkwa · 2024-02-15T19:46:26Z

What happened?

When running scale.yml, we are experiencing a race condition where sometimes /opt/cni/bin/calico is owned by the root user, and sometimes is owned by the kube user.

Due to the suid bit set by calico, when this binary is owned by the kube user, it lacks permissions to do everything it needs to do, and causes pods to be unable to schedule on this node.

-rwsr-xr-x 1 kube root 59136424 Jan 18 16:21 /opt/cni/bin/calico
#  ^ suid bit

Kubelet logs will then complain with errors such as:

Jan 19 14:31:54 myhostname kubelet[3077785]: E0119 14:31:54.400547 3077785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"16bce6ca-50d7-48cf-86a9-6783044c43b9\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"854e397957fc263ee551570388b32f33a00fe808935d11800bde6f5805715b90\\\": plugin type=\\\"calico\\\" failed (delete): error loading config file \\\"/etc/cni/net.d/calico-kubeconfig\\\": open /etc/cni/net.d/calico-kubeconfig: permission denied\"" pod="jaeger/jaeger-agent-daemonset-bmvtq" podUID=16bce6ca-50d7-48cf-86a9-6783044c43b9

See "Anything else we need to know" section below for even more details and investigation.

What did you expect to happen?

Pods to be scheduling on the new node.

/opt/cni/bin/calico to be owned by root.

How can we reproduce it (as minimally and precisely as possible)?

Not exactly sure since this is a race condition. But if you want to experience the failure behavior, you can do on a worker node:

chown kube /opt/cni/bin/calico
chmod 4755 /opt/cni/bin/calico

Then check kubelet logs while you try to do some cluster scheduling operations.

I tried to add a sleep before where the owner gets changed, but it doesn't quite reproduce it. There is some other factor at play, I think related to calico-node pod start up process. I have a theory in the "Anything else we need to know" section below.

OS

Kubernetes worker:

Linux 5.18.15-1.el8.elrepo.x86_64 x86_64
NAME="Rocky Linux"
VERSION="8.5 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.5"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8.5:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"

Ansible node: Alpine 3.14.2 docker container

Version of Ansible

ansible [core 2.14.14]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /root/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/.local/bin//ansible
  python version = 3.9.6 (default, Aug 27 2021, 23:46:59) [GCC 10.3.1 20210424] (/usr/local/bin/python3)
  jinja version = 3.1.3
  libyaml = True

Version of Python

Python 3.9.6

Version of Kubespray (commit)

3f6567b (aka v2.23.3)

Network plugin used

calico

Full inventory with variables

Vars for a worker node; scrubbed some stuff:

"addon_resizer_limits_cpu": "200m",
"addon_resizer_limits_memory": "50Mi",
"addon_resizer_requests_cpu": "100m",
"addon_resizer_requests_memory": "25Mi",
"apiserver_loadbalancer_domain_name": "test-cluster-api.test.example.com",
"argocd_apps_chart_version": "1.4.1",
"argocd_chart_version": "5.53.1",
"argocd_values_filename": "test-cluster-values.yaml",
"audit_log_maxage": 30,
"audit_log_maxbackups": 1,
"audit_log_maxsize": 100,
"audit_policy_file": "{{ kube_config_dir }}/audit-policy/apiserver-audit-policy.yaml",
"bin_dir": "/usr/local/bin",
"calico_apiserver_enabled": true,
"calico_felix_prometheusmetricsenabled": true,
"calico_ipip_mode": "Always",
"calico_iptables_backend": "Auto",
"calico_loglevel": "warning",
"calico_network_backend": "bird",
"calico_node_cpu_limit": "600m",
"calico_node_cpu_requests": "600m",
"calico_node_extra_envs": {
    "FELIX_MTUIFACEPATTERN": "^((en|wl|ww|sl|ib)[opsx].*|(eth|wlan|wwan|em|bond|p1|p2).*)"
},
"calico_node_memory_limit": "650Mi",
"calico_node_memory_requests": "650Mi",
"calico_policy_controller_cpu_limit": "300m",
"calico_policy_controller_cpu_requests": "300m",
"calico_policy_controller_memory_limit": "3000Mi",
"calico_policy_controller_memory_requests": "3000Mi",
"calico_pool_blocksize": 24,
"calico_vxlan_mode": "Never",
"cluster_name": "cluster.local",
"container_manager": "containerd",
"containerd_debug_level": "warn",
"containerd_extra_args": "[plugins.\"io.containerd.grpc.v1.cri\".registry.configs.\"regustry.example.com:5000\".auth]\n  auth = \"{{ scrubbed }}\"\n",
"containerd_registries": {
    "docker.io": "https://registry-1.docker.io"
},
"coredns_k8s_external_zone": "k8s_external.local",
"credentials_dir": "{{ inventory_dir }}/credentials",
"dashboard_enabled": false,
"default_kubelet_config_dir": "{{ kube_config_dir }}/dynamic_kubelet_dir",
"deploy_netchecker": false,
"dns_domain": "{{ cluster_name }}",
"dns_memory_limit": "250Mi",
"dns_memory_requests": "250Mi",
"dns_mode": "coredns",
"docker_image_repo": "{{ registry_host }}",
"drain_fallback_enabled": true,
"drain_fallback_grace_period": 0,
"drain_grace_period": 600,
"dynamic_kubelet_configuration": false,
"dynamic_kubelet_configuration_dir": "{{ kubelet_config_dir | default(default_kubelet_config_dir) }}",
"enable_coredns_k8s_endpoint_pod_names": false,
"enable_coredns_k8s_external": false,
"enable_ipv4_forwarding": true,
"enable_nodelocaldns": true,
"etcd_backup_retention_count": 5,
"etcd_data_dir": "/var/lib/etcd",
"etcd_deployment_type": "host",
"etcd_kubeadm_enabled": false,
"etcd_metrics": "extensive",
"event_ttl_duration": "1h0m0s",
"flush_iptables": false,
"gcr_image_repo": "{{ registry_host }}",
"github_image_repo": "{{ registry_host }}",
"group_names": [
    "k8s_cluster",
    "kube_node",
    "kubernetes_clusters"
],
"helm_deployment_type": "host",
"helm_enabled": true,
"inventory_hostname": "test-cluster-w-9.win.example3.com",
"inventory_hostname_short": "test-cluster-w-9",
"k8s_image_pull_policy": "IfNotPresent",
"kata_containers_enabled": false,
"kernel_devel_package": "kernel-ml-devel",
"kernel_headers_package": "kernel-ml-headers",
"kernel_package": "kernel-ml-5.18.15-1.el8.elrepo.x86_64",
"kernel_release": "5.18.15",
"kube_api_anonymous_auth": true,
"kube_api_pwd": "{{ lookup('password', credentials_dir + '/kube_user.creds length=15 chars=ascii_letters,digits') }}",
"kube_apiserver_insecure_port": 0,
"kube_apiserver_ip": "{{ kube_service_addresses|ipaddr('net')|ipaddr(1)|ipaddr('address') }}",
"kube_apiserver_port": 6443,
"kube_cert_dir": "{{ kube_config_dir }}/ssl",
"kube_cert_group": "kube-cert",
"kube_config_dir": "/etc/kubernetes",
"kube_encrypt_secret_data": false,
"kube_image_repo": "{{ registry_host }}",
"kube_log_level": 2,
"kube_manifest_dir": "{{ kube_config_dir }}/manifests",
"kube_network_node_prefix": 24,
"kube_network_plugin": "calico",
"kube_network_plugin_multus": false,
"kube_oidc_auth": true,
"kube_oidc_client_id": "test-cluster",
"kube_oidc_groups_claim": "groups",
"kube_oidc_url": "https://test-cluster-dex.test.example.com",
"kube_oidc_username_claim": "preferred_username",
"kube_oidc_username_prefix": "-",
"kube_pods_subnet": "10.233.64.0/18",
"kube_proxy_metrics_bind_address": "0.0.0.0:10249",
"kube_proxy_mode": "iptables",
"kube_proxy_nodeport_addresses": "{%- if kube_proxy_nodeport_addresses_cidr is defined -%} [{{ kube_proxy_nodeport_addresses_cidr }}] {%- else -%} [] {%- endif -%}",
"kube_proxy_strict_arp": false,
"kube_script_dir": "{{ bin_dir }}/kubernetes-scripts",
"kube_service_addresses": "10.233.0.0/18",
"kube_token_dir": "{{ kube_config_dir }}/tokens",
"kube_users": {
    "kube": {
        "groups": [
            "system:masters"
        ],
        "pass": "{{kube_api_pwd}}",
        "role": "admin"
    }
},
"kube_users_dir": "{{ kube_config_dir }}/users",
"kube_version": "v1.27.9",
"kubeadm_certificate_key": "{{ lookup('password', credentials_dir + '/kubeadm_certificate_key.creds length=64 chars=hexdigits') | lower }}",
"kubeadm_control_plane": false,
"kubelet_deployment_type": "host",
"kubelet_secure_addresses": "{%- for host in groups['kube_control_plane'] -%}\n  {{ hostvars[host]['ip'] | default(fallback_ips[host]) }}{{ ' ' if not loop.last else '' }}\n{%- endfor -%}",
"kubernetes_audit": true,
"loadbalancer_apiserver": {
    "address": "X.X.X.X",
    "port": 443
},
"local_release_dir": "/tmp/releases",
"metrics_server_cpu": "500m",
"metrics_server_enabled": true,
"metrics_server_limits_cpu": 1,
"metrics_server_limits_memory": "500Mi",
"metrics_server_memory": "300Mi",
"metrics_server_replicas": 2,
"metrics_server_requests_cpu": "500m",
"metrics_server_requests_memory": "300Mi",
"ndots": 2,
"nerdctl_enabled": true,
"networking_restart": false,
"node_labels": {
    "node-role.kubernetes.io/candidate-control-plane": "",
    "storage-node": "false",
    "topology.kubernetes.io/region": "XXX",
    "topology.kubernetes.io/zone": "XXX-YYY"
},
"nodelocaldns_cpu_requests": "100m",
"nodelocaldns_health_port": 9254,
"nodelocaldns_ip": "169.254.25.10",
"nodelocaldns_memory_limit": "200Mi",
"nodelocaldns_memory_requests": "200Mi",
"persistent_volumes_enabled": false,
"podsecuritypolicy_enabled": false,
"quay_image_repo": "{{ registry_host }}",
"reboot_timeout": 600,
"registry_host": "registry.example.com:5000",
"retry_stagger": 5,
"sealed_secrets_crt": "***********",
"sealed_secrets_ingress_class": "external",
"sealed_secrets_ingress_host": "test-cluster-sealed-secrets.test.example.com",
"sealed_secrets_key": "**********",
"skydns_server": "{{ kube_service_addresses|ipaddr('net')|ipaddr(3)|ipaddr('address') }}",
"skydns_server_secondary": "{{ kube_service_addresses|ipaddr('net')|ipaddr(4)|ipaddr('address') }}",
"ssl_client_cert_path": "/etc/pki/tls/certs/client.cert.pem",
"ssl_host_cert_path": "/etc/pki/tls/certs/host.cert.pem",
"ssl_host_key_path": "/etc/pki/tls/private/host.key.pem",
"ssl_root_cert_path": "/etc/pki/tls/certs/ca.cert.pem",
"upstream_dns_servers": [
    "8.8.8.8",
    "1.1.1.1"
],
"volume_cross_zone_attachment": false

Command used to invoke ansible

ansible-playbook -i /path/to/inventory/hosts.txt scale.yml -b --vault-password-file /path/to/vault/password --limit=$WORKER_NODE

Output of ansible run

I don't think is relevant with the info I provided below.

Anything else we need to know

When suid bit is set, and owner is kube, then my understanding is that the binary will always run as the kube user. Then when that happens, it cannot read from /etc/cni/net.d/calico-kubeconfig because of it's 600 permissions.

I believe the issue stems from scale.yml in this play. Specifically these two roles: kubernetes/kubeadm and network_plugin.

- name: Target only workers to get kubelet installed and checking in on any new nodes(network)
  #...
  roles:
    #...
    - { role: kubernetes/kubeadm, tags: kubeadm }
    #...
    - { role: network_plugin, tags: network }

The role kuberenetes/kubeadm will issue a kubeadmin join command. Then asynchronously the calico-node pod will start to run. This pod will create the /opt/cni/bin/calico file which doesn't yet exist.

Then in parallel, in network_plugin/cni/tasks/main.yml, it will do a recursive owner change against all of /opt/cni/bin/ to set it as the kube user.

There is one more factor at play, I think. Because doing the owner change will also remove the suid bit. But in the failure scenario, I'm seeing both the suid bit, AND kube owner.

Theory:

When the binary is in the process of creation, it first writes it to a tmp file (/opt/cni/bin/calico.tmp) file to stage it. I'm thinking it's possible the owner change happens at this point in time, affecting the temp file. Then the file gets renamed, followed by a chmod to set the suid bit (reference). The owner stays kube. This would explain how both the suid bit and kube owner are present at the same time.

Proposal Fix:

Would it be reasonable to allow for the /opt/cni/bin owner to be overridden? Something like owner: "{{ cni_bin_owner | default(kube_owner) }}" (or define the default in defaults/main.yml)?

The text was updated successfully, but these errors were encountered:

VannTen · 2024-02-16T08:16:31Z

Would it be reasonable to allow for the /opt/cni/bin owner to be overridden? Something like owner: "{{ cni_bin_owner | default(kube_owner) }}" (or define the default in defaults/main.yml)?

I'd rather fix the underlying problem. If that's indeed the race condition you describe, it could come back to bite us in surprising and hard to diagnose ways

Rickkwa · 2024-02-16T16:36:43Z

I agree. Would that be in the calico plugin?

I created an XS-sized PR that would allow it to unblock me. And there is also another use case for it in #10499. I'm hoping maybe the PR can be merged, while keeping this issue open.

lanss315425 · 2024-10-12T14:18:21Z

Hello, I am using v2.23.3 and also encountered this issue when adding a work node using scale. yml. It is currently in production and I do not want to change the current version. Can I fix this problem by merging the code differences? Thank you ，could you please let me know how to proceed? #10929 @Rickkwa

Rickkwa · 2024-10-15T13:41:03Z

@lanss315425 If your issue is indeed the same as mine, then you should be able to apply the patch from my PR and then use a group_var to set cni_bin_owner: root

rptaylor · 2024-11-25T22:33:01Z

I'd rather fix the underlying problem.

Agreed.

2/3 nodes I recently deployed with scale.yml were broken due to this. I opened #11747 to follow up with a hopefully more thorough solution, but I don't know enough about what would be involved.

I also noticed many more nodes with -rwxr-xr-x. 1 kube root 59136424 Aug 8 14:54 /opt/cni/bin/calico that did not seem to be broken, so this seems to be a milder form of the race condition (rwx instead of rws).

Rickkwa added the kind/bug Categorizes issue or PR as related to a bug. label Feb 15, 2024

Rickkwa mentioned this issue Feb 16, 2024

Support overriding cni directory owner #10929

Merged

k8s-ci-robot closed this as completed in #10929 Feb 19, 2024

rptaylor mentioned this issue Nov 25, 2024

Address CNI installation race condition #11747

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scale.yml race condition causing calico networking to malfunction #10928

scale.yml race condition causing calico networking to malfunction #10928

Rickkwa commented Feb 15, 2024 •

edited

Loading

VannTen commented Feb 16, 2024

Rickkwa commented Feb 16, 2024

lanss315425 commented Oct 12, 2024 •

edited

Loading

Rickkwa commented Oct 15, 2024

rptaylor commented Nov 25, 2024 •

edited

Loading

scale.yml race condition causing calico networking to malfunction #10928

scale.yml race condition causing calico networking to malfunction #10928

Comments

Rickkwa commented Feb 15, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

Command used to invoke ansible

Output of ansible run

Anything else we need to know

VannTen commented Feb 16, 2024

Rickkwa commented Feb 16, 2024

lanss315425 commented Oct 12, 2024 • edited Loading

Rickkwa commented Oct 15, 2024

rptaylor commented Nov 25, 2024 • edited Loading

Rickkwa commented Feb 15, 2024 •

edited

Loading

lanss315425 commented Oct 12, 2024 •

edited

Loading

rptaylor commented Nov 25, 2024 •

edited

Loading