-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading Ztunnel to 1.24 with cni.ambient.dnsCapture=true
causes DNS failures for workloads programmed by CNI 1.23
#1360
Comments
Thanks for the report. Do you happen to know if the first logs were like It could be too many open files -> DNS issues, or DNS issues --> too many retries --> too many files. Either way it is a bit odd Ztunnel restart did not resolve the issue. Am I correct in understanding this only occured upgrading 1.23 to 1.24, and after restarting the nodes cleanly on 1.24, there is no issues? |
Might be worth running |
The issue is #1282 / istio/istio#52867. In that, I said:
This is not quite right, since the CNI will not reconcile the iptables. So its really "CNI, restart all workloads, then upgrade ztunnel". Which is not great |
cni.ambient.dnsCapture=true
causes DNS failures for workloads programmed by CNI 1.23
Just to be very explicit - in the short term, the fix here is to restart your workloads and the issue will resolve. You can prevent the issue from happening in the first place by restarting your workloads between upgrading CNI and Ztunnel. We are exploring some fixes that could drop this requirement in an upcoming patch release |
Sound about reasonable. I did do the CNI and Ztunnel deployments in parallel using an umbrella chart. I just did another run did one update at the time as described above and everything went smoothly. BTW the ambient upgrade guide does list the steps in the right order but the description of the Ztunnel does only state the control plane must be updated first. Do you still require any further information like the order of the log messages? Seems to me the problem has been identified. Just for the record:
|
Nope, this one we understand fully and are working on some fixes to make it not require careful upgrade sequencing. FWIW it was also a 1 time transition, so an upgrade from 1.24 to 1.25, for example, wouldn't have issues when if no changes were made |
After upgrading to istio 1.24 the ztunnel on some (or all) nodes had problems after startup and never got ready.
The logs reported tens of thousands error like these.
It's hard to say if all nodes were affected because all karpenter nodepools have been relabeled causing nodes to be re-created in order to restart all pods forcing an injection of sidecars.
Restarting the ztunnel pod didn't solve the problem either. Only terminating the node solved the issue.
Problems with too many open files have no been an issue before.
EKS: 1.31
Nodes: Bottlerocket 1.26.1
istio: 1.23.3 -> 1.24.0
The text was updated successfully, but these errors were encountered: