-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scale.yml race condition causing calico networking to malfunction #10928
Comments
I'd rather fix the underlying problem. If that's indeed the race condition you describe, it could come back to bite us in surprising and hard to diagnose ways |
I agree. Would that be in the calico plugin? I created an XS-sized PR that would allow it to unblock me. And there is also another use case for it in #10499. I'm hoping maybe the PR can be merged, while keeping this issue open. |
Hello, I am using v2.23.3 and also encountered this issue when adding a work node using scale. yml. It is currently in production and I do not want to change the current version. Can I fix this problem by merging the code differences? Thank you ,could you please let me know how to proceed? #10929 @Rickkwa |
@lanss315425 If your issue is indeed the same as mine, then you should be able to apply the patch from my PR and then use a group_var to set |
Agreed. 2/3 nodes I recently deployed with scale.yml were broken due to this. I opened #11747 to follow up with a hopefully more thorough solution, but I don't know enough about what would be involved. I also noticed many more nodes with |
What happened?
When running
scale.yml
, we are experiencing a race condition where sometimes/opt/cni/bin/calico
is owned by theroot
user, and sometimes is owned by thekube
user.Due to the suid bit set by calico, when this binary is owned by the
kube
user, it lacks permissions to do everything it needs to do, and causes pods to be unable to schedule on this node.Kubelet logs will then complain with errors such as:
See "Anything else we need to know" section below for even more details and investigation.
What did you expect to happen?
Pods to be scheduling on the new node.
/opt/cni/bin/calico
to be owned byroot
.How can we reproduce it (as minimally and precisely as possible)?
Not exactly sure since this is a race condition. But if you want to experience the failure behavior, you can do on a worker node:
Then check kubelet logs while you try to do some cluster scheduling operations.
I tried to add a sleep before where the owner gets changed, but it doesn't quite reproduce it. There is some other factor at play, I think related to
calico-node
pod start up process. I have a theory in the "Anything else we need to know" section below.OS
Kubernetes worker:
Ansible node: Alpine 3.14.2 docker container
Version of Ansible
Version of Python
Python 3.9.6
Version of Kubespray (commit)
3f6567b (aka v2.23.3)
Network plugin used
calico
Full inventory with variables
Vars for a worker node; scrubbed some stuff:
Command used to invoke ansible
Output of ansible run
I don't think is relevant with the info I provided below.
Anything else we need to know
When suid bit is set, and owner is
kube
, then my understanding is that the binary will always run as thekube
user. Then when that happens, it cannot read from/etc/cni/net.d/calico-kubeconfig
because of it's600
permissions.I believe the issue stems from
scale.yml
in this play. Specifically these two roles:kubernetes/kubeadm
andnetwork_plugin
.The role
kuberenetes/kubeadm
will issue akubeadmin join
command. Then asynchronously thecalico-node
pod will start to run. This pod will create the/opt/cni/bin/calico
file which doesn't yet exist.Then in parallel, in
network_plugin/cni/tasks/main.yml
, it will do a recursive owner change against all of/opt/cni/bin/
to set it as thekube
user.There is one more factor at play, I think. Because doing the owner change will also remove the suid bit. But in the failure scenario, I'm seeing both the suid bit, AND
kube
owner.Theory:
When the binary is in the process of creation, it first writes it to a tmp file (
/opt/cni/bin/calico.tmp
) file to stage it. I'm thinking it's possible the owner change happens at this point in time, affecting the temp file. Then the file gets renamed, followed by achmod
to set thesuid
bit (reference). The owner stayskube
. This would explain how both the suid bit andkube
owner are present at the same time.Proposal Fix:
Would it be reasonable to allow for the
/opt/cni/bin
owner to be overridden? Something likeowner: "{{ cni_bin_owner | default(kube_owner) }}"
(or define the default indefaults/main.yml
)?The text was updated successfully, but these errors were encountered: