Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EBS CSI Driver issue causing kubetest2 failures - IMDS metadata and Kubernetes metadata are both unavailable #1061

Open
mmerkes opened this issue Nov 25, 2024 · 6 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mmerkes
Copy link
Contributor

mmerkes commented Nov 25, 2024

Which jobs are failing:

pull-cloud-provider-aws-e2e-kubetest2-quick
pull-cloud-provider-aws-e2e-kubetest2

Which test(s) are failing:
BeforeSuite is failing because CPI nodes aren't stabilizing.

Since when has it been failing:
This one passed on 10/31.

This one failed on 11/6. So sometime between these two.

Testgrid link:

  1. First seen failure
  2. Failed 11/25

Reason for failure:

EBS CSI pod is not stabilizing:

2024-11-25T18:30:42.52251214Z stderr F I1125 18:30:42.522404       1 main.go:157] "Initializing metadata"
2024-11-25T18:30:47.523520821Z stderr F E1125 18:30:47.523424       1 metadata.go:51] "Retrieving IMDS metadata failed, falling back to Kubernetes metadata" err="could not get EC2 instance identity metadata: operation error ec2imds: GetInstanceIdentityDocument, canceled, context deadline exceeded"
2024-11-25T18:30:47.530862069Z stderr F E1125 18:30:47.530760       1 metadata.go:58] "Retrieving Kubernetes metadata failed" err="could not retrieve instance type from topology label"
2024-11-25T18:30:47.530928736Z stderr F E1125 18:30:47.530882       1 main.go:162] "Failed to initialize metadata when it is required" err="IMDS metadata and Kubernetes metadata are both unavailable"

Anything else we need to know:

/kind failing-test

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 25, 2024
@mmerkes
Copy link
Contributor Author

mmerkes commented Nov 25, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 25, 2024
@dims
Copy link
Member

dims commented Nov 25, 2024

cc @ConnorJC3 @torredil

@mmerkes
Copy link
Contributor Author

mmerkes commented Nov 25, 2024

Not sure if they're related to each other, but also see this error in kubelet:

Nov 25 18:34:03 ip-172-31-24-156 kubelet[6298]: E1125 18:34:03.425509 6298 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"aws-cloud-controller-manager\" with ImagePullBackOff: \"Back-off pulling image \\\"209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea\\\": ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image \\\"209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea\\\": failed to resolve reference \\\"209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea\\\": 209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea: not found\"" pod="kube-system/aws-cloud-controller-manager-cq6m2" podUID="b6d43d27-1967-414e-86f8-72b3e9375664"

@ConnorJC3
Copy link

Not sure if they're related to each other, but also see this error in kubelet:

Very likely related - as I believe it is the AWS CCM that adds the labels we rely on for metadata to the node.

@mmerkes
Copy link
Contributor Author

mmerkes commented Nov 25, 2024

Very likely related - as I believe it is the AWS CCM that adds the labels we rely on for metadata to the node.

Sounds right. Looks like that's a red herring.

@lavalex
Copy link

lavalex commented Dec 18, 2024

I'm getting this error on Openshift .... Any ideas how to solve it? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants