Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading Ztunnel to 1.24 with cni.ambient.dnsCapture=true causes DNS failures for workloads programmed by CNI 1.23 #1360

Open
joke opened this issue Nov 8, 2024 · 6 comments · May be fixed by istio/istio#53906
Assignees

Comments

@joke
Copy link

joke commented Nov 8, 2024

After upgrading to istio 1.24 the ztunnel on some (or all) nodes had problems after startup and never got ready.
The logs reported tens of thousands error like these.

It's hard to say if all nodes were affected because all karpenter nodepools have been relabeled causing nodes to be re-created in order to restart all pods forcing an injection of sidecars.

{"level":"warn","time":"2024-11-08T08:32:16.370762Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 16243 got: 15069, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371049Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 55666 got: 43184, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371061Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 47010 got: 27434, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371069Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 59852 got: 17994, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371086Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 38329 got: 27832, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371402Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 63660 got: 7533, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.372715Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 41538 got: 33375, dropped"}
{"level":"error","time":"2024-11-08T08:32:16.372955Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372966Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372968Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372970Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372971Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372973Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}

Restarting the ztunnel pod didn't solve the problem either. Only terminating the node solved the issue.
Problems with too many open files have no been an issue before.

EKS: 1.31
Nodes: Bottlerocket 1.26.1
istio: 1.23.3 -> 1.24.0

@howardjohn
Copy link
Member

Thanks for the report. Do you happen to know if the first logs were like expected message id: 16243 got: 15069, dropped or Too many open files ? Clearly they are releated, but curious if there is any indication which one is the original cause.

It could be too many open files -> DNS issues, or DNS issues --> too many retries --> too many files.

Either way it is a bit odd Ztunnel restart did not resolve the issue.


Am I correct in understanding this only occured upgrading 1.23 to 1.24, and after restarting the nodes cleanly on 1.24, there is no issues?

@bleggett
Copy link
Contributor

bleggett commented Nov 8, 2024

Might be worth running sysctl fs.file-nr on the affected nodes.

@howardjohn
Copy link
Member

The issue is #1282 / istio/istio#52867.

In that, I said:

The supported upgrade path is CNI first, then Ztunnel.

CNI w/ this patch, Ztunnel 1.23: TCP will start redirecting, which is
already supported by Ztunnel. UDP change does nothing

CNI + Ztunnel patched (with #1282):
DNS requests come from the application pod. UDP packets are marked, so
they do not loop due to the CNI change here.

This is not quite right, since the CNI will not reconcile the iptables. So its really "CNI, restart all workloads, then upgrade ztunnel". Which is not great

@howardjohn howardjohn changed the title ztunnel 1.24 connection problems during startup Upgrading Ztunnel to 1.24 with cni.ambient.dnsCapture=true causes DNS failures for workloads programmed by CNI 1.23 Nov 8, 2024
@howardjohn
Copy link
Member

Just to be very explicit - in the short term, the fix here is to restart your workloads and the issue will resolve. You can prevent the issue from happening in the first place by restarting your workloads between upgrading CNI and Ztunnel.

We are exploring some fixes that could drop this requirement in an upcoming patch release

@joke
Copy link
Author

joke commented Nov 11, 2024

Sound about reasonable. I did do the CNI and Ztunnel deployments in parallel using an umbrella chart.

I just did another run did one update at the time as described above and everything went smoothly.

BTW the ambient upgrade guide does list the steps in the right order but the description of the Ztunnel does only state the control plane must be updated first.

Do you still require any further information like the order of the log messages? Seems to me the problem has been identified.

Just for the record:

bash-5.1# sysctl fs.file-nr
fs.file-nr = 17568      0       92233720368547758

@howardjohn
Copy link
Member

Nope, this one we understand fully and are working on some fixes to make it not require careful upgrade sequencing. FWIW it was also a 1 time transition, so an upgrade from 1.24 to 1.25, for example, wouldn't have issues when if no changes were made

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants