-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node sync errors when using a custom (domain-name
empty) AWS DHCP Option Set for the VPC
#384
Comments
We lack some documentation in this area, but if you enable Resource-based naming on your instances, custom DHCP options will work. If you use IP-based naming, CCM will expect the FQDN of IP-based hostnames. |
IP based naming is the default and most commonly used across AWS end users. I feel like we should try to fix this rather than providing a workaround that some may call a breaking change. Certainly in some managed software offerings based on Kubernetes I would expect that workaround to be considered an unacceptable change as the node naming must be consistent for a node during its own lifetime, but also within and among the others within the cluster Do we know if there has been any previous discussion about this bug that lead to a wont-fix decision or can we open the floor for ideas on how to fix this? |
CCM has always required FQDN for ip based node names. And with the domain being the regional default. No change there. But for RBN, we have decided to relax this. Note that node node names will remain consistent for the lifetime of the individual node. Having different conventions while transitioning is entirely graceful. kOps do this in periodically running e2es. What the default is depends on the installer. Not really that many installers using external CCM yet. But kOps does and kOps also transitions to RBN as part of it. |
/triage accepted |
- What I did I Added AWS specific systemd unit (aws-kubelet-providerid.service) and file (/usr/local/bin/aws-kubelet-providerid) for generating the AWS instance provider-id (then stored in the KUBELET_PROVIDERID env var), in order to pass it as the --provider-id argument to the kubelet service binary. We needed to add such flag, and make it non-empty only on AWS, to make the node syncing (specifically backing instance detection) work via provider-id detection, to cover cases where the node hostname doesn't match the expected private-dns-name (e.g. when a custom DHCP Option Set with empty domain-name is used). Should fix: https://bugzilla.redhat.com/show_bug.cgi?id=2084450 Reference to an upstream issue with context: kubernetes/cloud-provider-aws#384 - How to verify it Try the reproduction steps available at: https://bugzilla.redhat.com/show_bug.cgi?id=2084450#c0 while launching a cluster with this MCO PR included. Verify that the issue is not reproducible anymore.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/remove-lifecycle rotten |
/lifecycle frozen |
What happened:
When a custom AWS DHCP Option Set, with empty
domain-name
, is assigned to the cluster VPC, and a node joins the cluster shortly after, the node syncing in the cloud provider's node-controller fails with the following:And the node, after briefly appearing
is then deleted shortly after.
What you expected to happen:
The node syncing should succeed, as the instance backing the node should be found by the node-controller.
How to reproduce it (as minimally and precisely as possible):
domain-name
:aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]'
kubectl get nodes -w
Anything else we need to know?:
After a bit of digging it turns out this boils down to how the nodeName is computed in the kubelet vs. the assumptions we do in the cloud provider.
The kubelet computes the nodeName by invoking
getNodeName()
which in turn behaves in different ways depending on whether in-tree vs. external providers are used. More in detail when the--cloud-provider=external
is set on the kubeletcloud
will benil
and the hostname will be used as a value fornodeName
.The AWS cloud provider, when syncing the Node in the node-controller, tries to find the instance backing the node by describing all instances and filtering out the one with
private-dns-name
matching the nodeName (which in this case is the hostname).This works when the hostname has the same value of the
private-dns-name
, but doesn't in cases where they differ.For example when a node is created with the custom DHCP Option Set previously described, the
hostname
will be of the form:ip-10-0-144-157
as opposed to itsprivated-dns-name
which will be of the form:ip-10-0-144-157.ec2.internal
.Environment:
kubectl version
):v1.23.3+69213f8
v1.23.0
Red Hat Enterprise Linux CoreOS 411.85.202205101201-0 (Ootpa)
uname -a
):4.18.0-348.23.1.el8_5.x86_64
/kind bug
The text was updated successfully, but these errors were encountered: