Upgrading Ztunnel to 1.24 with `cni.ambient.dnsCapture=true` causes DNS failures for workloads programmed by CNI 1.23 #1360

joke · 2024-11-08T09:01:24Z

After upgrading to istio 1.24 the ztunnel on some (or all) nodes had problems after startup and never got ready.
The logs reported tens of thousands error like these.

It's hard to say if all nodes were affected because all karpenter nodepools have been relabeled causing nodes to be re-created in order to restart all pods forcing an injection of sidecars.

{"level":"warn","time":"2024-11-08T08:32:16.370762Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 16243 got: 15069, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371049Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 55666 got: 43184, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371061Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 47010 got: 27434, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371069Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 59852 got: 17994, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371086Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 38329 got: 27832, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.371402Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 63660 got: 7533, dropped"}
{"level":"warn","time":"2024-11-08T08:32:16.372715Z","scope":"hickory_proto::udp::udp_client_stream","message":"expected message id: 41538 got: 33375, dropped"}
{"level":"error","time":"2024-11-08T08:32:16.372955Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372966Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372968Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372970Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372971Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}
{"level":"error","time":"2024-11-08T08:32:16.372973Z","scope":"ztunnel::proxy::outbound","message":"Failed TCP handshake Too many open files (os error 24)"}

Restarting the ztunnel pod didn't solve the problem either. Only terminating the node solved the issue.
Problems with too many open files have no been an issue before.

EKS: 1.31
Nodes: Bottlerocket 1.26.1
istio: 1.23.3 -> 1.24.0

The text was updated successfully, but these errors were encountered:

howardjohn · 2024-11-08T16:43:27Z

Thanks for the report. Do you happen to know if the first logs were like expected message id: 16243 got: 15069, dropped or Too many open files ? Clearly they are releated, but curious if there is any indication which one is the original cause.

It could be too many open files -> DNS issues, or DNS issues --> too many retries --> too many files.

Either way it is a bit odd Ztunnel restart did not resolve the issue.

Am I correct in understanding this only occured upgrading 1.23 to 1.24, and after restarting the nodes cleanly on 1.24, there is no issues?

bleggett · 2024-11-08T21:24:46Z

Might be worth running sysctl fs.file-nr on the affected nodes.

howardjohn · 2024-11-08T21:58:09Z

The issue is #1282 / istio/istio#52867.

In that, I said:

The supported upgrade path is CNI first, then Ztunnel.

CNI w/ this patch, Ztunnel 1.23: TCP will start redirecting, which is
already supported by Ztunnel. UDP change does nothing

CNI + Ztunnel patched (with #1282):
DNS requests come from the application pod. UDP packets are marked, so
they do not loop due to the CNI change here.

This is not quite right, since the CNI will not reconcile the iptables. So its really "CNI, restart all workloads, then upgrade ztunnel". Which is not great

howardjohn · 2024-11-08T22:37:05Z

Just to be very explicit - in the short term, the fix here is to restart your workloads and the issue will resolve. You can prevent the issue from happening in the first place by restarting your workloads between upgrading CNI and Ztunnel.

We are exploring some fixes that could drop this requirement in an upcoming patch release

joke · 2024-11-11T08:58:17Z

Sound about reasonable. I did do the CNI and Ztunnel deployments in parallel using an umbrella chart.

I just did another run did one update at the time as described above and everything went smoothly.

BTW the ambient upgrade guide does list the steps in the right order but the description of the Ztunnel does only state the control plane must be updated first.

Do you still require any further information like the order of the log messages? Seems to me the problem has been identified.

Just for the record:

bash-5.1# sysctl fs.file-nr
fs.file-nr = 17568      0       92233720368547758

howardjohn · 2024-11-11T13:41:07Z

Nope, this one we understand fully and are working on some fixes to make it not require careful upgrade sequencing. FWIW it was also a 1 time transition, so an upgrade from 1.24 to 1.25, for example, wouldn't have issues when if no changes were made

howardjohn changed the title ~~ztunnel 1.24 connection problems during startup~~ Upgrading Ztunnel to 1.24 with cni.ambient.dnsCapture=true causes DNS failures for workloads programmed by CNI 1.23 Nov 8, 2024

howardjohn assigned bleggett Nov 11, 2024

howardjohn mentioned this issue Nov 11, 2024

1.24.1 tracking issue istio/istio#53855

Closed

24 tasks

This was referenced Nov 14, 2024

Idempotency and reconciliation for cni iptables istio/istio#53153

Merged

Ambient cni node agent: Reconcile pod iptables rules on startup istio/istio#53906

Open

keithmattix mentioned this issue Nov 18, 2024

1.24.2 Tracking Issue istio/istio#53934

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading Ztunnel to 1.24 with `cni.ambient.dnsCapture=true` causes DNS failures for workloads programmed by CNI 1.23 #1360

Upgrading Ztunnel to 1.24 with `cni.ambient.dnsCapture=true` causes DNS failures for workloads programmed by CNI 1.23 #1360

joke commented Nov 8, 2024 •

edited

Loading

howardjohn commented Nov 8, 2024

bleggett commented Nov 8, 2024 •

edited

Loading

howardjohn commented Nov 8, 2024

howardjohn commented Nov 8, 2024

joke commented Nov 11, 2024

howardjohn commented Nov 11, 2024

Upgrading Ztunnel to 1.24 with cni.ambient.dnsCapture=true causes DNS failures for workloads programmed by CNI 1.23 #1360

Upgrading Ztunnel to 1.24 with cni.ambient.dnsCapture=true causes DNS failures for workloads programmed by CNI 1.23 #1360

Comments

joke commented Nov 8, 2024 • edited Loading

howardjohn commented Nov 8, 2024

bleggett commented Nov 8, 2024 • edited Loading

howardjohn commented Nov 8, 2024

howardjohn commented Nov 8, 2024

joke commented Nov 11, 2024

howardjohn commented Nov 11, 2024

Upgrading Ztunnel to 1.24 with `cni.ambient.dnsCapture=true` causes DNS failures for workloads programmed by CNI 1.23 #1360

Upgrading Ztunnel to 1.24 with `cni.ambient.dnsCapture=true` causes DNS failures for workloads programmed by CNI 1.23 #1360

joke commented Nov 8, 2024 •

edited

Loading

bleggett commented Nov 8, 2024 •

edited

Loading