Running linkerd-proxy as a native sidecar fails for some argo workflow pods #13349

bwmetcalf · 2024-11-19T17:28:31Z

What is the issue?

We are injecting linkerd-proxy in our argo workflows as a native sidecar using the annotation

config.alpha.linkerd.io/proxy-enable-native-sidecar: true

One of our workflows that spins up multiple pods works fine through the first two or three pods containing multiple steps, but with one of the pods linkerd-proxy exits with a 137 and the following are the last events for this pod

 Normal   Killing    37m   kubelet            Stopping container linkerd-proxy
 Warning  Unhealthy  37m   kubelet            Readiness probe failed: Get "http://10.3.175.1:4191/ready": dial tcp 10.3.175.1:4191: connect: connection refused

which causes argo server to mark the step as failed and fails the entire workflow. All other preceding pods in the workflow have only

 Normal   Killing    37m   kubelet            Stopping container linkerd-proxy

as their last event. It seems for whatever reason in this particular pod there is a race condition where the health probes are running as the proxy container is shutting down.

Are there corresponding parameters that possibly should be tweaked when using injecting linkerd-proxy as a native sidecar?

How can it be reproduced?

This isn't clear. I don't yet have a test case as these are fairly complex workflows.

Logs, error output, etc

See above.

output of `linkerd check -o short`

% linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.11.3 but the latest edge version is 24.11.4
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.11.3 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-5ddc58f9bc-5x9nh (edge-24.11.3)
	* linkerd-destination-5ddc58f9bc-7gkdk (edge-24.11.3)
	* linkerd-destination-5ddc58f9bc-9c99t (edge-24.11.3)
	* linkerd-destination-5ddc58f9bc-brbh5 (edge-24.11.3)
	* linkerd-destination-5ddc58f9bc-ffmdx (edge-24.11.3)
	* linkerd-identity-85fb8c4b5f-c6l7m (edge-24.11.3)
	* linkerd-identity-85fb8c4b5f-ctr4h (edge-24.11.3)
	* linkerd-identity-85fb8c4b5f-jhp8q (edge-24.11.3)
	* linkerd-identity-85fb8c4b5f-nzx8w (edge-24.11.3)
	* linkerd-identity-85fb8c4b5f-vfmkc (edge-24.11.3)
	* linkerd-proxy-injector-5497b8cb97-fw85c (edge-24.11.3)
	* linkerd-proxy-injector-5497b8cb97-g22xn (edge-24.11.3)
	* linkerd-proxy-injector-5497b8cb97-g2m2v (edge-24.11.3)
	* linkerd-proxy-injector-5497b8cb97-gjfwv (edge-24.11.3)
	* linkerd-proxy-injector-5497b8cb97-jwrnl (edge-24.11.3)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-5ddc58f9bc-5x9nh running edge-24.11.3 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-5789bcc5d-2zdck (edge-24.11.3)
	* prometheus-9c78c7f55-7q88p (edge-24.11.3)
	* tap-6688cddf94-st2jc (edge-24.11.3)
	* tap-injector-85b47576fc-9k222 (edge-24.11.3)
	* web-8c5b96b6-s7ggv (edge-24.11.3)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-5789bcc5d-2zdck running edge-24.11.3 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Server Version: v1.29.8-eks-a737599

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

The text was updated successfully, but these errors were encountered:

fullykubed · 2024-11-21T22:18:08Z

@bwmetcalf I believe you are suffering from this issue: Panfactum/stack#164

bwmetcalf · 2024-11-23T19:01:09Z

@bwmetcalf I believe you are suffering from this issue: Panfactum/stack#164

Thanks. I'll give it a try and report back. For now, we are not injecting as a native sidecar and using the following annotation

      workflows.argoproj.io/kill-cmd-linkerd-proxy: '["/usr/lib/linkerd/linkerd-await","sleep","1","--shutdown"]'

This is working, but it seems like injecting mesh proxies as native sidecars is preferable.

bwmetcalf added the bug label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running linkerd-proxy as a native sidecar fails for some argo workflow pods #13349

Running linkerd-proxy as a native sidecar fails for some argo workflow pods #13349

bwmetcalf commented Nov 19, 2024

fullykubed commented Nov 21, 2024

bwmetcalf commented Nov 23, 2024 •

edited

Loading

Running linkerd-proxy as a native sidecar fails for some argo workflow pods #13349

Running linkerd-proxy as a native sidecar fails for some argo workflow pods #13349

Comments

bwmetcalf commented Nov 19, 2024

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

fullykubed commented Nov 21, 2024

bwmetcalf commented Nov 23, 2024 • edited Loading

output of `linkerd check -o short`

bwmetcalf commented Nov 23, 2024 •

edited

Loading