Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to C3 machine types for prow build cluster #6525

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

upodroid
Copy link
Member

@upodroid upodroid commented Mar 5, 2024

/cc @ameukam @BenTheElder @xmudrii

We should get better performance out of these instances.

C3 instance types are a bit tricky. Our options are:

  • c3d-highmem-8-lssd which provides 8 cores, 64GB of ram and a single local SSDs
  • c3-standard-8-lssd which provides 8 cores, 32GB of ram and a two local SSDs

I'm in favour offering 7 cores and 28GB to all our prowjobs across the board. This setting guarantees a dedicated node for testing. The remaining core belongs to us to run our stuff on it

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 5, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: upodroid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/infra Infrastructure management, infrastructure design, code in infra/ area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters approved Indicates a PR has been approved by an approver from all required OWNERS files. area/terraform Terraform modules, testing them, writing more of them, code in infra/gcp/clusters/ sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 5, 2024
@upodroid upodroid changed the title Switch to N2 machine types Switch to N2 machine types for prow build cluster Mar 5, 2024
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 5, 2024
@BenTheElder
Copy link
Member

BenTheElder commented Mar 5, 2024

I also used a custom type to insert 2 extra cores for all the extra daemonset we need to run.

We really shouldn't need 2 cores of daemonsets ...?
/hold

We should be able to tighten this up, discarding 20% of the node as overhead is excessive.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2024
@BenTheElder
Copy link
Member

Maybe we can import from the existing metrics collection?

@upodroid
Copy link
Member Author

upodroid commented Mar 5, 2024

According to the calculations here:
We could adopt c3-standard-8-lssd right now and disable calico. I'm not sure why we need a network policy enabled on the build cluster. The capacity being used by calico would be available for node-exporter

@BenTheElder
Copy link
Member

[...] and disable calico. I'm not sure why we need a network policy enabled on the build cluster. The capacity being used by calico would be available for node-exporter

That sounds good to me (have not reviewed machine types yet). I don't recall an discussion about enabling network policy and I don't think we need it, 400m CPU is quite a bit back to fit other daemonsets cc @ameukam

@BenTheElder
Copy link
Member

Just checked: k8s-prow-builds in google.com (cluster: default) has calico network policy disabled.

@upodroid
Copy link
Member Author

upodroid commented Mar 5, 2024

One more question, are we ok with nodes having 32gb memory instead of 64gb?

@BenTheElder
Copy link
Member

BenTheElder commented Mar 5, 2024

EDIT: jinx
c3-standard-8-lssd has lower memory / core (32gb for 8 cores) versus n1-highmem-8 (52 gb for 8 cores).

I'm not sure if that's a problem for any of our workloads or not. I think we might have a few that are a bit lopsided in memory>cpu requests currently

@upodroid upodroid added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Mar 5, 2024
@upodroid
Copy link
Member Author

upodroid commented Mar 5, 2024

Apply plan:

  1. apply the 1st commit using terraform
  2. Cordon and Drain the old nodepool in the console
  3. Apply the 2nd commit

After lgtm is applied on the PR.

@upodroid upodroid changed the title Switch to N2 machine types for prow build cluster Switch to C3 machine types for prow build cluster Mar 5, 2024
@BenTheElder
Copy link
Member

I think we should probably query the jobs to see if any have 30+ GB or a larger memory : core ratio than c3, if so, we can do n2?

@ameukam
Copy link
Member

ameukam commented Mar 6, 2024

[...] and disable calico. I'm not sure why we need a network policy enabled on the build cluster. The capacity being used by calico would be available for node-exporter

That sounds good to me (have not reviewed machine types yet). I don't recall an discussion about enabling network policy and I don't think we need it, 400m CPU is quite a bit back to fit other daemonsets cc @ameukam

Network Policy was enabled by default at the time the cluster was created. We never took the opportunity to disable it. +1 for the disablement.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 7, 2024
@upodroid
Copy link
Member Author

upodroid commented Mar 7, 2024

Instead of tweaking the prowjobs, lets go with N2 instances(Ice Lake platform) and keep the same machine-type and merge that change this weekend.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 5, 2024
@upodroid upodroid removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 5, 2024
@k8s-ci-robot k8s-ci-robot added area/infra/gcp Issues or PRs related to Kubernetes GCP infrastructure size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 2, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/infra/gcp Issues or PRs related to Kubernetes GCP infrastructure area/infra Infrastructure management, infrastructure design, code in infra/ area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters area/terraform Terraform modules, testing them, writing more of them, code in infra/gcp/clusters/ cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants