Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale.yml race condition causing calico networking to malfunction #10928

Closed
Rickkwa opened this issue Feb 15, 2024 · 5 comments · Fixed by #10929
Closed

scale.yml race condition causing calico networking to malfunction #10928

Rickkwa opened this issue Feb 15, 2024 · 5 comments · Fixed by #10929
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Rickkwa
Copy link
Contributor

Rickkwa commented Feb 15, 2024

What happened?

When running scale.yml, we are experiencing a race condition where sometimes /opt/cni/bin/calico is owned by the root user, and sometimes is owned by the kube user.

Due to the suid bit set by calico, when this binary is owned by the kube user, it lacks permissions to do everything it needs to do, and causes pods to be unable to schedule on this node.

-rwsr-xr-x 1 kube root 59136424 Jan 18 16:21 /opt/cni/bin/calico
#  ^ suid bit

Kubelet logs will then complain with errors such as:

Jan 19 14:31:54 myhostname kubelet[3077785]: E0119 14:31:54.400547 3077785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"16bce6ca-50d7-48cf-86a9-6783044c43b9\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"854e397957fc263ee551570388b32f33a00fe808935d11800bde6f5805715b90\\\": plugin type=\\\"calico\\\" failed (delete): error loading config file \\\"/etc/cni/net.d/calico-kubeconfig\\\": open /etc/cni/net.d/calico-kubeconfig: permission denied\"" pod="jaeger/jaeger-agent-daemonset-bmvtq" podUID=16bce6ca-50d7-48cf-86a9-6783044c43b9

See "Anything else we need to know" section below for even more details and investigation.

What did you expect to happen?

Pods to be scheduling on the new node.

/opt/cni/bin/calico to be owned by root.

How can we reproduce it (as minimally and precisely as possible)?

Not exactly sure since this is a race condition. But if you want to experience the failure behavior, you can do on a worker node:

chown kube /opt/cni/bin/calico
chmod 4755 /opt/cni/bin/calico

Then check kubelet logs while you try to do some cluster scheduling operations.

I tried to add a sleep before where the owner gets changed, but it doesn't quite reproduce it. There is some other factor at play, I think related to calico-node pod start up process. I have a theory in the "Anything else we need to know" section below.

OS

Kubernetes worker:

Linux 5.18.15-1.el8.elrepo.x86_64 x86_64
NAME="Rocky Linux"
VERSION="8.5 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.5"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8.5:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"

Ansible node: Alpine 3.14.2 docker container

Version of Ansible

ansible [core 2.14.14]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /root/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/.local/bin//ansible
  python version = 3.9.6 (default, Aug 27 2021, 23:46:59) [GCC 10.3.1 20210424] (/usr/local/bin/python3)
  jinja version = 3.1.3
  libyaml = True

Version of Python

Python 3.9.6

Version of Kubespray (commit)

3f6567b (aka v2.23.3)

Network plugin used

calico

Full inventory with variables

Vars for a worker node; scrubbed some stuff:

"addon_resizer_limits_cpu": "200m",
"addon_resizer_limits_memory": "50Mi",
"addon_resizer_requests_cpu": "100m",
"addon_resizer_requests_memory": "25Mi",
"apiserver_loadbalancer_domain_name": "test-cluster-api.test.example.com",
"argocd_apps_chart_version": "1.4.1",
"argocd_chart_version": "5.53.1",
"argocd_values_filename": "test-cluster-values.yaml",
"audit_log_maxage": 30,
"audit_log_maxbackups": 1,
"audit_log_maxsize": 100,
"audit_policy_file": "{{ kube_config_dir }}/audit-policy/apiserver-audit-policy.yaml",
"bin_dir": "/usr/local/bin",
"calico_apiserver_enabled": true,
"calico_felix_prometheusmetricsenabled": true,
"calico_ipip_mode": "Always",
"calico_iptables_backend": "Auto",
"calico_loglevel": "warning",
"calico_network_backend": "bird",
"calico_node_cpu_limit": "600m",
"calico_node_cpu_requests": "600m",
"calico_node_extra_envs": {
    "FELIX_MTUIFACEPATTERN": "^((en|wl|ww|sl|ib)[opsx].*|(eth|wlan|wwan|em|bond|p1|p2).*)"
},
"calico_node_memory_limit": "650Mi",
"calico_node_memory_requests": "650Mi",
"calico_policy_controller_cpu_limit": "300m",
"calico_policy_controller_cpu_requests": "300m",
"calico_policy_controller_memory_limit": "3000Mi",
"calico_policy_controller_memory_requests": "3000Mi",
"calico_pool_blocksize": 24,
"calico_vxlan_mode": "Never",
"cluster_name": "cluster.local",
"container_manager": "containerd",
"containerd_debug_level": "warn",
"containerd_extra_args": "[plugins.\"io.containerd.grpc.v1.cri\".registry.configs.\"regustry.example.com:5000\".auth]\n  auth = \"{{ scrubbed }}\"\n",
"containerd_registries": {
    "docker.io": "https://registry-1.docker.io"
},
"coredns_k8s_external_zone": "k8s_external.local",
"credentials_dir": "{{ inventory_dir }}/credentials",
"dashboard_enabled": false,
"default_kubelet_config_dir": "{{ kube_config_dir }}/dynamic_kubelet_dir",
"deploy_netchecker": false,
"dns_domain": "{{ cluster_name }}",
"dns_memory_limit": "250Mi",
"dns_memory_requests": "250Mi",
"dns_mode": "coredns",
"docker_image_repo": "{{ registry_host }}",
"drain_fallback_enabled": true,
"drain_fallback_grace_period": 0,
"drain_grace_period": 600,
"dynamic_kubelet_configuration": false,
"dynamic_kubelet_configuration_dir": "{{ kubelet_config_dir | default(default_kubelet_config_dir) }}",
"enable_coredns_k8s_endpoint_pod_names": false,
"enable_coredns_k8s_external": false,
"enable_ipv4_forwarding": true,
"enable_nodelocaldns": true,
"etcd_backup_retention_count": 5,
"etcd_data_dir": "/var/lib/etcd",
"etcd_deployment_type": "host",
"etcd_kubeadm_enabled": false,
"etcd_metrics": "extensive",
"event_ttl_duration": "1h0m0s",
"flush_iptables": false,
"gcr_image_repo": "{{ registry_host }}",
"github_image_repo": "{{ registry_host }}",
"group_names": [
    "k8s_cluster",
    "kube_node",
    "kubernetes_clusters"
],
"helm_deployment_type": "host",
"helm_enabled": true,
"inventory_hostname": "test-cluster-w-9.win.example3.com",
"inventory_hostname_short": "test-cluster-w-9",
"k8s_image_pull_policy": "IfNotPresent",
"kata_containers_enabled": false,
"kernel_devel_package": "kernel-ml-devel",
"kernel_headers_package": "kernel-ml-headers",
"kernel_package": "kernel-ml-5.18.15-1.el8.elrepo.x86_64",
"kernel_release": "5.18.15",
"kube_api_anonymous_auth": true,
"kube_api_pwd": "{{ lookup('password', credentials_dir + '/kube_user.creds length=15 chars=ascii_letters,digits') }}",
"kube_apiserver_insecure_port": 0,
"kube_apiserver_ip": "{{ kube_service_addresses|ipaddr('net')|ipaddr(1)|ipaddr('address') }}",
"kube_apiserver_port": 6443,
"kube_cert_dir": "{{ kube_config_dir }}/ssl",
"kube_cert_group": "kube-cert",
"kube_config_dir": "/etc/kubernetes",
"kube_encrypt_secret_data": false,
"kube_image_repo": "{{ registry_host }}",
"kube_log_level": 2,
"kube_manifest_dir": "{{ kube_config_dir }}/manifests",
"kube_network_node_prefix": 24,
"kube_network_plugin": "calico",
"kube_network_plugin_multus": false,
"kube_oidc_auth": true,
"kube_oidc_client_id": "test-cluster",
"kube_oidc_groups_claim": "groups",
"kube_oidc_url": "https://test-cluster-dex.test.example.com",
"kube_oidc_username_claim": "preferred_username",
"kube_oidc_username_prefix": "-",
"kube_pods_subnet": "10.233.64.0/18",
"kube_proxy_metrics_bind_address": "0.0.0.0:10249",
"kube_proxy_mode": "iptables",
"kube_proxy_nodeport_addresses": "{%- if kube_proxy_nodeport_addresses_cidr is defined -%} [{{ kube_proxy_nodeport_addresses_cidr }}] {%- else -%} [] {%- endif -%}",
"kube_proxy_strict_arp": false,
"kube_script_dir": "{{ bin_dir }}/kubernetes-scripts",
"kube_service_addresses": "10.233.0.0/18",
"kube_token_dir": "{{ kube_config_dir }}/tokens",
"kube_users": {
    "kube": {
        "groups": [
            "system:masters"
        ],
        "pass": "{{kube_api_pwd}}",
        "role": "admin"
    }
},
"kube_users_dir": "{{ kube_config_dir }}/users",
"kube_version": "v1.27.9",
"kubeadm_certificate_key": "{{ lookup('password', credentials_dir + '/kubeadm_certificate_key.creds length=64 chars=hexdigits') | lower }}",
"kubeadm_control_plane": false,
"kubelet_deployment_type": "host",
"kubelet_secure_addresses": "{%- for host in groups['kube_control_plane'] -%}\n  {{ hostvars[host]['ip'] | default(fallback_ips[host]) }}{{ ' ' if not loop.last else '' }}\n{%- endfor -%}",
"kubernetes_audit": true,
"loadbalancer_apiserver": {
    "address": "X.X.X.X",
    "port": 443
},
"local_release_dir": "/tmp/releases",
"metrics_server_cpu": "500m",
"metrics_server_enabled": true,
"metrics_server_limits_cpu": 1,
"metrics_server_limits_memory": "500Mi",
"metrics_server_memory": "300Mi",
"metrics_server_replicas": 2,
"metrics_server_requests_cpu": "500m",
"metrics_server_requests_memory": "300Mi",
"ndots": 2,
"nerdctl_enabled": true,
"networking_restart": false,
"node_labels": {
    "node-role.kubernetes.io/candidate-control-plane": "",
    "storage-node": "false",
    "topology.kubernetes.io/region": "XXX",
    "topology.kubernetes.io/zone": "XXX-YYY"
},
"nodelocaldns_cpu_requests": "100m",
"nodelocaldns_health_port": 9254,
"nodelocaldns_ip": "169.254.25.10",
"nodelocaldns_memory_limit": "200Mi",
"nodelocaldns_memory_requests": "200Mi",
"persistent_volumes_enabled": false,
"podsecuritypolicy_enabled": false,
"quay_image_repo": "{{ registry_host }}",
"reboot_timeout": 600,
"registry_host": "registry.example.com:5000",
"retry_stagger": 5,
"sealed_secrets_crt": "***********",
"sealed_secrets_ingress_class": "external",
"sealed_secrets_ingress_host": "test-cluster-sealed-secrets.test.example.com",
"sealed_secrets_key": "**********",
"skydns_server": "{{ kube_service_addresses|ipaddr('net')|ipaddr(3)|ipaddr('address') }}",
"skydns_server_secondary": "{{ kube_service_addresses|ipaddr('net')|ipaddr(4)|ipaddr('address') }}",
"ssl_client_cert_path": "/etc/pki/tls/certs/client.cert.pem",
"ssl_host_cert_path": "/etc/pki/tls/certs/host.cert.pem",
"ssl_host_key_path": "/etc/pki/tls/private/host.key.pem",
"ssl_root_cert_path": "/etc/pki/tls/certs/ca.cert.pem",
"upstream_dns_servers": [
    "8.8.8.8",
    "1.1.1.1"
],
"volume_cross_zone_attachment": false

Command used to invoke ansible

ansible-playbook -i /path/to/inventory/hosts.txt scale.yml -b --vault-password-file /path/to/vault/password --limit=$WORKER_NODE

Output of ansible run

I don't think is relevant with the info I provided below.

Anything else we need to know

When suid bit is set, and owner is kube, then my understanding is that the binary will always run as the kube user. Then when that happens, it cannot read from /etc/cni/net.d/calico-kubeconfig because of it's 600 permissions.

I believe the issue stems from scale.yml in this play. Specifically these two roles: kubernetes/kubeadm and network_plugin.

- name: Target only workers to get kubelet installed and checking in on any new nodes(network)
  #...
  roles:
    #...
    - { role: kubernetes/kubeadm, tags: kubeadm }
    #...
    - { role: network_plugin, tags: network }

The role kuberenetes/kubeadm will issue a kubeadmin join command. Then asynchronously the calico-node pod will start to run. This pod will create the /opt/cni/bin/calico file which doesn't yet exist.

Then in parallel, in network_plugin/cni/tasks/main.yml, it will do a recursive owner change against all of /opt/cni/bin/ to set it as the kube user.

There is one more factor at play, I think. Because doing the owner change will also remove the suid bit. But in the failure scenario, I'm seeing both the suid bit, AND kube owner.

Theory:

When the binary is in the process of creation, it first writes it to a tmp file (/opt/cni/bin/calico.tmp) file to stage it. I'm thinking it's possible the owner change happens at this point in time, affecting the temp file. Then the file gets renamed, followed by a chmod to set the suid bit (reference). The owner stays kube. This would explain how both the suid bit and kube owner are present at the same time.

Proposal Fix:

Would it be reasonable to allow for the /opt/cni/bin owner to be overridden? Something like owner: "{{ cni_bin_owner | default(kube_owner) }}" (or define the default in defaults/main.yml)?

@Rickkwa Rickkwa added the kind/bug Categorizes issue or PR as related to a bug. label Feb 15, 2024
@VannTen
Copy link
Contributor

VannTen commented Feb 16, 2024

Would it be reasonable to allow for the /opt/cni/bin owner to be overridden? Something like owner: "{{ cni_bin_owner | default(kube_owner) }}" (or define the default in defaults/main.yml)?

I'd rather fix the underlying problem. If that's indeed the race condition you describe, it could come back to bite us in surprising and hard to diagnose ways

@Rickkwa
Copy link
Contributor Author

Rickkwa commented Feb 16, 2024

I agree. Would that be in the calico plugin?

I created an XS-sized PR that would allow it to unblock me. And there is also another use case for it in #10499. I'm hoping maybe the PR can be merged, while keeping this issue open.

@lanss315425
Copy link

lanss315425 commented Oct 12, 2024

Hello, I am using v2.23.3 and also encountered this issue when adding a work node using scale. yml. It is currently in production and I do not want to change the current version. Can I fix this problem by merging the code differences? Thank you ,could you please let me know how to proceed? #10929 @Rickkwa

@Rickkwa
Copy link
Contributor Author

Rickkwa commented Oct 15, 2024

@lanss315425 If your issue is indeed the same as mine, then you should be able to apply the patch from my PR and then use a group_var to set cni_bin_owner: root

@rptaylor
Copy link
Contributor

rptaylor commented Nov 25, 2024

I'd rather fix the underlying problem.

Agreed.

2/3 nodes I recently deployed with scale.yml were broken due to this. I opened #11747 to follow up with a hopefully more thorough solution, but I don't know enough about what would be involved.

I also noticed many more nodes with -rwxr-xr-x. 1 kube root 59136424 Aug 8 14:54 /opt/cni/bin/calico that did not seem to be broken, so this seems to be a milder form of the race condition (rwx instead of rws).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants