Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 ec2/byoip: fix EIP leak when creating machine #5039

Merged

Conversation

mtulio
Copy link
Contributor

@mtulio mtulio commented Jun 27, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

The instance creation flow is creating by default EIP to instances even if the BYO IP flow is set. BYO IPv4 creates and associates the EIP to instance after it is created, preventing paying for additional EIP (amazon-provided) when creating the instance when the BYO IPv4 Pool is defined to be used by the machine.

Furthermore, the fix provides additional checks to prevent duplicated EIP in the BYO IP reconciliation loop. The extra checks include running the EIP association many times, while the EIP is already associated, and failures in the log when running the EIP association prematurely - when the instance isn't ready, Eg ec2 in pending state.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5038

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation (N/A)
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

fix duplicated/leaked EIP when using BYO IPv4 on Machines.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 27, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 27, 2024
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 27, 2024
@nrb
Copy link
Contributor

nrb commented Jun 27, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 27, 2024
pkg/cloud/services/ec2/eip.go Outdated Show resolved Hide resolved
pkg/cloud/services/ec2/eip.go Outdated Show resolved Hide resolved
pkg/cloud/services/ec2/instances.go Show resolved Hide resolved
pkg/cloud/services/ec2/eip.go Outdated Show resolved Hide resolved
pkg/cloud/services/ec2/eip.go Outdated Show resolved Hide resolved
@mtulio mtulio force-pushed the OCPBUGS-36293-fix-byoip-eip branch from b58e3e0 to ace1bee Compare June 28, 2024 03:30
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 28, 2024
@mtulio mtulio force-pushed the OCPBUGS-36293-fix-byoip-eip branch from ace1bee to d5882fa Compare June 28, 2024 04:26
@mtulio
Copy link
Contributor Author

mtulio commented Jun 28, 2024

/test ?

@k8s-ci-robot
Copy link
Contributor

@mtulio: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-build-docker
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-build-docker
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mtulio
Copy link
Contributor Author

mtulio commented Jun 28, 2024

/test pull-cluster-api-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Jun 28, 2024

Premature failure.

/test pull-cluster-api-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Jul 3, 2024

/test pull-cluster-api-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Jul 4, 2024

/test pull-cluster-api-provider-aws-e2e

Okay, previous test failures were flake. The latest run pass. OpenShift e2e BYOIP test is also passing install:

time="2024-06-28T19:16:32Z" level=debug msg="E0628 19:16:32.239704     331 
awsmachine_controller.go:544] \"failed to reconcile BYO Public IPv4\" 
err=\"unable to reconcile Elastic IP Pool to instance \\\"i-05af21b09d3552d0f\\\" with state: pending\""

[...]

time="2024-06-28T19:16:50Z" level=debug msg="I0628 19:16:50.432073     331 eip.go:44] 
\"machine is already associated with an Elastic IP with custom Public IPv4 pool\" 
controller=\"awsmachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" 
controllerKind=\"AWSMachine\" AWSMachine=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt-bootstrap\" namespace=\"openshift-cluster-api-guests\" 
name=\"ci-op-f04mmlsn-88881-49rgt-bootstrap\" 
reconcileID=\"825b335b-569b-4a16-891d-c6b7fb5a9db6\" 
machine=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt-bootstrap\" 
cluster=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt\" 
eip=\"eipalloc-0678f0da2c1ecd771\" eip-address=\"157.254.217.22\" 
eip-associationID=\"eipassoc-0ea51b9c68cdfb171\" eip-instance=\"i-05af21b09d3552d0f\""

This PR is ready for review. PTAL?
/assign @Ankitasw @dlipovetsky
cc @r4f4 @nrb

/test pull-cluster-api-provider-aws-e2e-eks

@mtulio
Copy link
Contributor Author

mtulio commented Jul 5, 2024

e2e EKS test is passing.

I am holding this PR to address the @r4f4 's feedback to decrease the warnings in the code, silent re-queuing expected states.

/hold

@mtulio
Copy link
Contributor Author

mtulio commented Aug 31, 2024

Hi @nrb , would you mind taking a review in this bug, please?

As you see in the last comments, we are struggling to run the job pull-cluster-api-provider-aws-e2e, but the e2e-eks, and ele-conformance are passing (not sure under the hood if it's all touching this change, but considering this change is changing Machine creation flow, it think it should).

Furthermore, in downstream/OpenShift we are running several presubmit jobs (Public IPv4 pool is default over aws jobs) across the PR openshift/installer#8676 (vendoring this PR). I also introduced a new presubmit (openshift/release#56114) to enforce to disable the pool in CAPA provisioning to test the non-pool flow and it is all passing: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8676/pull-ci-openshift-installer-master-e2e-aws-ovn-public-ipv4-pool-disabled/1829621227333881856

I will trigger again, but also open to hear from you if you could share your tougths of another job to validate it, or those presented is ok.

Looking forward to hear from you, thanks!

cc @r4f4 @patrickdillon

/test pull-cluster-api-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Sep 2, 2024

/assign @nrb

@mtulio
Copy link
Contributor Author

mtulio commented Sep 2, 2024

/test pull-cluster-api-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Sep 3, 2024

tl;dr: looks like pull-cluster-api-provider-aws-e2e is failing for unrelated change in this PR, it is trying to create a cluster with an AMI that does not exist in the test account.


After some investigation with @nrb, we are seeing the job e2e always failing the test [It] [unmanaged] [Cluster API Framework] Clusterctl Upgrade Spec [from latest v1beta1 release to v1beta2] Should create a management cluster and then upgrade all the providers .

We are seeing in the Control Plane spec that the AMI for kube 1.25 isn't available:

  - lastTransitionTime: "2024-09-02T15:00:37Z"
    message: 'failed to create AWSMachine instance: failed to find ami: found no AMIs
      with the name: "capa-ami-ubuntu-18.04-?1.25.0-*"'

Looks like the CAPI e2e[1] is setting the following variable:

# INIT_WITH_KUBERNETES_VERSION are only used by the clusterctl upgrade test to initialize
# the management cluster to be upgraded.
INIT_WITH_KUBERNETES_VERSION: "v1.25.0"

in the test spec:

ginkgo.Describe("Clusterctl Upgrade Spec [from latest v1beta1 release to v1beta2]", func() {
ginkgo.BeforeEach(func() {
// As the resources cannot be defined by the It() clause in CAPI tests, using the largest values required for all It() tests in this CAPI test.
requiredResources = &shared.TestResource{EC2Normal: 5 * e2eCtx.Settings.InstanceVCPU, IGW: 2, NGW: 2, VPC: 2, ClassicLB: 2, EIP: 2, EventBridgeRules: 50}
requiredResources.WriteRequestedResources(e2eCtx, "capi-clusterctl-upgrade-test-v1beta1-to-v1beta2")
Expect(shared.AcquireResources(requiredResources, ginkgo.GinkgoParallelProcess(), flock.New(shared.ResourceQuotaFilePath))).To(Succeed())
})
capi_e2e.ClusterctlUpgradeSpec(ctx, func() capi_e2e.ClusterctlUpgradeSpecInput {
return capi_e2e.ClusterctlUpgradeSpecInput{
E2EConfig: e2eCtx.E2EConfig,
ClusterctlConfigPath: e2eCtx.Environment.ClusterctlConfigPath,
BootstrapClusterProxy: e2eCtx.Environment.BootstrapClusterProxy,
ArtifactFolder: e2eCtx.Settings.ArtifactFolder,
SkipCleanup: e2eCtx.Settings.SkipCleanup,
MgmtFlavor: "remote-management-cluster",
InitWithBinary: e2eCtx.E2EConfig.GetVariable("INIT_WITH_BINARY_V1BETA1"),
InitWithKubernetesVersion: e2eCtx.E2EConfig.GetVariable("INIT_WITH_KUBERNETES_VERSION"),

Causing the failures when looking up for an AMI that does not exists in the test account. (maybe had been pruned or 1.25 is not supported and dont need anymore?)

[1] https://github.com/kubernetes-sigs/cluster-api/blob/main/test/e2e/clusterctl_upgrade.go#L237C33-L237C58

@mtulio
Copy link
Contributor Author

mtulio commented Sep 4, 2024

For reviewers: this PR is general ready for review. Following my last comment, the failure is unrelated with this PR.

pkg/cloud/services/ec2/eip.go Outdated Show resolved Hide resolved
@nrb
Copy link
Contributor

nrb commented Sep 4, 2024

/override pull-cluster-api-provider-aws-e2e

This test is failing due to something unrelated right now.

@k8s-ci-robot
Copy link
Contributor

@nrb: nrb unauthorized: /override is restricted to Repo administrators.

In response to this:

/override pull-cluster-api-provider-aws-e2e

This test is failing due to something unrelated right now.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mtulio mtulio force-pushed the OCPBUGS-36293-fix-byoip-eip branch from c804c11 to 16eeaa7 Compare September 5, 2024 19:01
The instance creation flow is creating by default EIP to instances even
if the BYO IP flow is set. BYO IPv4 creates and associates the EIP to
instance after it is created, preventing paying for additional EIP
(amazon-provided) when creating the instance when the BYO IPv4 Pool is
defined to be used by the machine.

Furthermore, the fix provides additional checks to prevent duplicated EIP
in the BYO IP reconciliation loop. The extra checks include running the
EIP association many times, while the EIP is already associated, and
failures in the log when running the EIP association prematurely - when
the instance isn't ready, Eg ec2 in pending state.
@mtulio mtulio force-pushed the OCPBUGS-36293-fix-byoip-eip branch from 16eeaa7 to 4626a6a Compare September 9, 2024 15:35
@mtulio
Copy link
Contributor Author

mtulio commented Sep 9, 2024

PR #5118 merged, PR rebased to re-test the failed upgrade test.
/test pull-cluster-api-provider-aws-e2e

@mtulio mtulio changed the title 🐛: ec2/byoip: fix EIP leak when creating machine 🐛 ec2/byoip: fix EIP leak when creating machine Sep 9, 2024
@nrb
Copy link
Contributor

nrb commented Sep 9, 2024

I don't think the failure is related to this PR, nor the previous image issues we were seeing; the AWSMachine resources get to a Ready state and don't have errors themselves.

/retest

@mtulio
Copy link
Contributor Author

mtulio commented Sep 10, 2024

@mtulio: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-e2e 4626a6a link false /test pull-cluster-api-provider-aws-e2e
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

I am seeing a lot CloudFormation to provision required environment, I can't see if could be related.

/retest

@mtulio
Copy link
Contributor Author

mtulio commented Sep 10, 2024

@nrb e2e passing now! 🎉

@nrb
Copy link
Contributor

nrb commented Sep 10, 2024

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nrb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2024
@rvanderp3
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2024
@mtulio
Copy link
Contributor Author

mtulio commented Sep 10, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2024
@k8s-ci-robot k8s-ci-robot merged commit 3f3ce56 into kubernetes-sigs:main Sep 10, 2024
23 checks passed
@mtulio mtulio deleted the OCPBUGS-36293-fix-byoip-eip branch September 10, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent leaking EIP when creating machines with BYO IPv4 Pool
7 participants