Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alerts to catch Knative TestGrid pods not running #1066

Open
michelle192837 opened this issue Oct 12, 2022 · 1 comment
Open

Add alerts to catch Knative TestGrid pods not running #1066

michelle192837 opened this issue Oct 12, 2022 · 1 comment

Comments

@michelle192837
Copy link
Collaborator

Stuck in CrashLoopBackoff due to permissions issue reading the config, e.g.:

jsonPayload: {
error: "observe config: can't read "gs://knative-own-testgrid/config": open: Get "https://storage.googleapis.com/knative-own-testgrid/config": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: The caller does not have permission
This error could be caused by a missing IAM policy binding on the target IAM service account.
For more information, refer to the Workload Identity documentation:
	https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

`"
file: "cmd/summarizer/main.go:151"
func: "main.main"
level: "error"
msg: "Could not summarize"
}

I ran https://github.com/GoogleCloudPlatform/testgrid/blob/master/cluster/bind-service-accounts.sh to see if any of the SAs need to be re-bound, and it seems like the answer was 'yes':

./bind-service-accounts.sh
Service accounts:
./canary/api.yaml:    iam.gke.io/gcp-service-account: [email protected]
./canary/api.yaml:  namespace: testgrid-canary
./canary/api.yaml:      serviceAccountName: api
./canary/config_merger.yaml:    iam.gke.io/gcp-service-account: [email protected]
./canary/config_merger.yaml:  namespace: testgrid-canary
./canary/config_merger.yaml:      serviceAccountName: config-merger
./canary/monitoring.yaml:  namespace: testgrid-canary
./canary/summarizer.yaml:    iam.gke.io/gcp-service-account: [email protected]
./canary/summarizer.yaml:  namespace: testgrid-canary
./canary/summarizer.yaml:      serviceAccountName: summarizer
./canary/tabulator.yaml:    iam.gke.io/gcp-service-account: [email protected]
./canary/tabulator.yaml:  namespace: testgrid-canary
./canary/tabulator.yaml:      serviceAccountName: tabulator
./canary/updater.yaml:    iam.gke.io/gcp-service-account: [email protected]
./canary/updater.yaml:  namespace: testgrid-canary
./canary/updater.yaml:      serviceAccountName: updater
./prod/config_merger.yaml:    iam.gke.io/gcp-service-account: [email protected]
./prod/config_merger.yaml:  namespace: testgrid
./prod/config_merger.yaml:      serviceAccountName: config-merger
./prod/knative/summarizer.yaml:    iam.gke.io/gcp-service-account: [email protected]
./prod/knative/summarizer.yaml:  namespace: knative
./prod/knative/summarizer.yaml:      serviceAccountName: summarizer
./prod/knative/tabulator.yaml:    iam.gke.io/gcp-service-account: [email protected]
./prod/knative/tabulator.yaml:  namespace: knative
./prod/knative/tabulator.yaml:      serviceAccountName: tabulator
./prod/knative/updater.yaml:    iam.gke.io/gcp-service-account: [email protected]
./prod/knative/updater.yaml:  namespace: knative
./prod/knative/updater.yaml:      serviceAccountName: updater
./prod/monitoring.yaml:  namespace: testgrid
./prod/README.md:1. Bind the service account(s) for the component in the `testgrid-canary` namespace:
./prod/README.md:1. Bind the service account(s) for the component in the `testgrid` namespace:
./prod/summarizer.yaml:    iam.gke.io/gcp-service-account: [email protected]
./prod/summarizer.yaml:  namespace: testgrid
./prod/summarizer.yaml:      serviceAccountName: summarizer
./prod/tabulator.yaml:    iam.gke.io/gcp-service-account: [email protected]
./prod/tabulator.yaml:  namespace: testgrid
./prod/tabulator.yaml:      serviceAccountName: tabulator
./prod/updater.yaml:    iam.gke.io/gcp-service-account: [email protected]
./prod/updater.yaml:  namespace: testgrid
./prod/updater.yaml:      serviceAccountName: updater
./setup.sh:echo -n 'testgrid namespace: ' >&2
NOOP: testgrid-canary/config-merger has workloadIdentityUser access to [email protected]
NOOP: testgrid-canary/summarizer has workloadIdentityUser access to [email protected]
NOOP: testgrid-canary/tabulator has workloadIdentityUser access to [email protected]
NOOP: testgrid-canary/updater has workloadIdentityUser access to [email protected]
serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater] in serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer]
Grant serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer] roles/iam.workloadIdentityUser access to [email protected]? [y/N] y
+ /usr/bin/gcloud iam service-accounts --project knative-tests add-iam-policy-binding [email protected] --role roles/iam.workloadIdentityUser --member 'serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer]'
Updated IAM policy for serviceAccount [[email protected]].
bindings:
- members:
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer]
  - serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater]
  role: roles/iam.workloadIdentityUser
etag: BwXq2u1cNwo=
version: 1
DONE: gave knative/summarizer workloadIdentityUser access to [email protected]
serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater] in serviceAccount:k8s-testgrid.svc.id.goog[knative/tabulator]
Grant serviceAccount:k8s-testgrid.svc.id.goog[knative/tabulator] roles/iam.workloadIdentityUser access to [email protected]? [y/N] y
+ /usr/bin/gcloud iam service-accounts --project knative-tests add-iam-policy-binding [email protected] --role roles/iam.workloadIdentityUser --member 'serviceAccount:k8s-testgrid.svc.id.goog[knative/tabulator]'
Updated IAM policy for serviceAccount [[email protected]].
bindings:
- members:
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer]
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/tabulator]
  - serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater]
  role: roles/iam.workloadIdentityUser
etag: BwXq2u2Rpkc=
version: 1
DONE: gave knative/tabulator workloadIdentityUser access to [email protected]
serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater] in serviceAccount:k8s-testgrid.svc.id.goog[knative/updater]
Grant serviceAccount:k8s-testgrid.svc.id.goog[knative/updater] roles/iam.workloadIdentityUser access to [email protected]? [y/N] y
+ /usr/bin/gcloud iam service-accounts --project knative-tests add-iam-policy-binding [email protected] --role roles/iam.workloadIdentityUser --member 'serviceAccount:k8s-testgrid.svc.id.goog[knative/updater]'
Updated IAM policy for serviceAccount [[email protected]].
bindings:
- members:
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer]
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/tabulator]
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/updater]
  - serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater]
  role: roles/iam.workloadIdentityUser
etag: BwXq2u4Lseg=
version: 1
DONE: gave knative/updater workloadIdentityUser access to [email protected]
NOOP: testgrid/config-merger has workloadIdentityUser access to [email protected]
NOOP: testgrid/summarizer has workloadIdentityUser access to [email protected]
NOOP: testgrid/tabulator has workloadIdentityUser access to [email protected]
NOOP: testgrid/updater has workloadIdentityUser access to [email protected]
@michelle192837
Copy link
Collaborator Author

michelle192837 commented Oct 12, 2022

It looks like the pods are able to start now! Remaining tasks:

  • Wait for these to catch up on updates to verify the problem is fixed
  • Add alerts to catch pods stuck in CrashLoopBackoff for too long.

@chases2 chases2 changed the title Knative TestGrid pods not running Add alerts to catch Knative TestGrid pods not running Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant