Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: 🐛 Attempt to clean up CF IAM users #5242

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nrb
Copy link
Contributor

@nrb nrb commented Dec 6, 2024

What type of PR is this?
/kind failing-test

What this PR does / why we need it:

Periodic tests seemed to get into a failure loop because an IAM user
with the same name already existed, which is not allowed. This then
failed the entire CloudFoundation stack. Depite the stack claiming to
have been rolled back, the next iteration would run into the same
problem.

This change includes IAM users in the list of resources we need to
specifically delete in the case of a CloudFoundation failure, just in
case they've leaked

Special notes for your reviewer:

The periodic tests at https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-aws#periodic-e2e-release-2-7 were failing roughly every other day between Nov 23, 2024 to Dec 6, 2024.
We'd seen failures prior to that, but testgrid's history doesn't appear to go that far back.

Nearly all the failures within the capa-e2e.[SynchronizedBeforeSuite] function contained this log entry:

STEP: Event details for AWSIAMUserBootstrapper : Resource: AWS::IAM::User, Status: CREATE_FAILED, Reason: Resource handler returned message: "Resource of type 'AWS::IAM::User' with identifier 'bootstrapper.cluster-api-provider-aws.sigs.k8s.io' already exists." (RequestToken: 9149fdc5-32aa-007f-086d-d60101e23ee9, HandlerErrorCode: AlreadyExists) @ 12/05/24 15:56:04.338

Checklist:

  • includes emojis
  • adds or updates e2e tests

Release note:
-->

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from nrb. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nrb nrb changed the title 🐛 Attempt to clean up CF IAM users WIP: 🐛 Attempt to clean up CF IAM users Dec 6, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2024
@nrb
Copy link
Contributor Author

nrb commented Dec 6, 2024

/test ?

@k8s-ci-robot
Copy link
Contributor

@nrb: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-build-docker
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-build-docker
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nrb
Copy link
Contributor Author

nrb commented Dec 6, 2024

/test pull-cluster-api-provider-aws-e2e

@nrb nrb force-pushed the clean-up-cf-user branch from a3bedc8 to d77ca9d Compare December 6, 2024 22:37
@damdo
Copy link
Member

damdo commented Dec 7, 2024

/test pull-cluster-api-provider-aws-e2e

@nrb
Copy link
Contributor Author

nrb commented Dec 8, 2024

Probably needs to be rebased onto #5240

Periodic tests seemed to get into a failure loop because an IAM user
with the same name already existed, which is not allowed. This then
failed the entire CloudFoundation stack. Depite the stack claiming to
have been rolled back, the next iteration would run into the same
problem.

This change includes IAM users in the list of resources we need to
specifically delete in the case of a CloudFoundation failure, just in
case they've leaked

Signed-off-by: Nolan Brubaker <[email protected]>
@nrb nrb force-pushed the clean-up-cf-user branch from d77ca9d to 5a34a13 Compare December 9, 2024 13:57
@nrb
Copy link
Contributor Author

nrb commented Dec 9, 2024

/test pull-cluster-api-provider-aws-e2e

@nrb
Copy link
Contributor Author

nrb commented Dec 9, 2024

/test pull-cluster-api-provider-aws-test

VPC limit was reached for this test.

@k8s-ci-robot
Copy link
Contributor

@nrb: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-e2e 5a34a13 link false /test pull-cluster-api-provider-aws-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nrb nrb mentioned this pull request Dec 11, 2024
1 task
@richardcase
Copy link
Member

This looks good to me. It also points to a failure in aws-janitor potentially as that should clean up a AWS account that has a failed test.

@nrb
Copy link
Contributor Author

nrb commented Dec 12, 2024

@richardcase Yeah, I asked about the janitor on Slack. The IAM code only looks at roles and instance policies (https://github.com/kubernetes-sigs/boskos/tree/master/aws-janitor/resources).

I'm suspecting that what could be happening is that multiple periodics are using CF at the same time and stepping on each other. With your account logging PR, we can double check that in the future.

@AndiDog
Copy link
Contributor

AndiDog commented Dec 31, 2024

I'm running in to the same issue regularly with PR E2E tests, thanks for fixing this!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants