Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabulator runs into GCS object rate-limiting right at startup #1036

Open
chases2 opened this issue Jun 17, 2022 · 5 comments
Open

Tabulator runs into GCS object rate-limiting right at startup #1036

chases2 opened this issue Jun 17, 2022 · 5 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed kind/oncall-hotlist Categorizes issue or PR as tracked by test-infra oncall.

Comments

@chases2
Copy link
Collaborator

chases2 commented Jun 17, 2022

When the Tabulator first boots, it will try to do a lot of writes quickly. When several of those writes are on the same object, we get blocks of these errors (about 6 or so on the same object) for a little while:

jsonPayload: {
  config: "gs://k8s-testgrid/config"
  dashboard: "sig-node-containerd"
  file: "pkg/tabulator/tabstate.go:273"
  func: "github.com/GoogleCloudPlatform/testgrid/pkg/tabulator.Update.func5"
  group: "ci-cos-containerd-node-e2e"
  level: "error"
  msg: "write: client.Upload(gs://k8s-testgrid/tabs/sig-node-containerd/image-validation-node-e2e): close: googleapi: Error 429: The rate of change requests to the object k8s-testgrid/tabs/sig-node-containerd/image-validation-node-e2e exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded"
  tab: "image-validation-node-e2e"
}

Then things tend to stabilize, but this burst of errors occurs pretty regularly whenever the tabulator starts up.

Additional syncing may be needed to prevent two (or six!) writing goroutines from trying to update exactly the same file.

@chases2 chases2 added bug Something isn't working help wanted Extra attention is needed kind/oncall-hotlist Categorizes issue or PR as tracked by test-infra oncall. labels Jun 17, 2022
@chases2 chases2 self-assigned this Sep 23, 2022
@chases2
Copy link
Collaborator Author

chases2 commented Sep 24, 2022

Simply setting concurrency to match that in production (128) or much higher didn't recreate the error. I plan to check the pubsub functionality next; I suspect that's the only way to practically get a bunch of writes queued to the same object.

@listx
Copy link
Contributor

listx commented Oct 20, 2022

Any update here @chases2 ?

@chases2
Copy link
Collaborator Author

chases2 commented Nov 28, 2022

This issue doesn't seem to be occurring much, but the Tabulator is crashing pretty frequently while leaving no errors.
image

I'm not confident calling this "fixed" until the Tabulator is behaving well and running continuous for at least a few hours at a time.

@chases2
Copy link
Collaborator Author

chases2 commented Nov 29, 2022

Moved the crash issue to #1089. Once that's resolved, I'll check again for the initial symptom of this bug (a lot of rateLimitExceeded errors at startup)

@chases2
Copy link
Collaborator Author

chases2 commented Dec 22, 2022

We still get blocks of these errors, though it's occasionally throughout the runtime, and not all at once. Not sure how much of an issue this is. Considering downgrading to "warning" for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed kind/oncall-hotlist Categorizes issue or PR as tracked by test-infra oncall.
Projects
None yet
Development

No branches or pull requests

2 participants