-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add remote persistent worker support #787
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from BuildBuddy side.
CI failed due to a Github outage yesterday, but I think a rebase + push should retry it.
examples/persistent_worker/README.md
Outdated
export BUILDBUDDY_CONTAINER_USER=... # GitHub user name | ||
export BUILDBUDDY_CONTAINER_PASSWORD=... # GitHub access token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to note that this is optional if the container image you are using is publicly downloadable.
remote_execution_properties = { | ||
"OSFamily": "Linux", | ||
"container-image": image, | ||
"workload-isolation-type": "podman", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: don't need to set isolation type specifically. We(BuildBuddy) may want to change the default isolation type underneath while maintaining backward compatibility. (In fact, we did recently stopped using podman as default isolation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to set it this way because I got a credentials error on image download with the default.
Thanks for the PR! This looks good. We started working on support for this internally but unfortunately it doesn't match the bazel spec exactly. For reasons that aren't entirely clear to me we can't attach the 'worker key' to the platform so it's attached elsewhere on the action. I can't see why we can't support both APIs though and have a default 'bazel mode' for this behavior. The main other difference is that on our end we construct an RE::Action for the worker, upload it and use the digest of that action as the 'worker key', instead of what I assume is requiring that the worker args are a prefix of the action args (in which case the 'worker key' doesn't really seem to matter?). Similarly we can support both in 'bazel mode' though. Just a heads up that there is likely to be some churn around this at some point and I'm slightly concerned we may not have an easy time testing that this does the right thing in all edge cases for 'bazel mode', but I suppose we can deal with that when we get to it. If it's easy, it would be nice to have a github action for testing the remote example. I'm not sure how difficult that is. |
That's great to hear! Yes, I was hoping for something along those lines.
In the Bazel version the worker key is used to associate a given action with a potentially already running worker instance on a remote executor node. But, it is not directly tied to any kind of previously uploaded blob. Bazel calculates a digest of the worker command and its inputs and uses that as a worker key. In this PR I went for the same approach.
That makes sense. I'll look into how to test this on the CI. |
I noticed that the example did not use the I will continue looking into ways to test this feature on CI. |
f179b29
to
f8c20ef
Compare
Just to give a heads up on expected progress here. I'm on leave for the next two weeks and will get back to this when I'm back. I've started making the setup independent on Nix, so that it is easier to integrate with CI here. That's already working locally. |
Hi @aherrmann, @KapJI is looking into setting up a build buddy account. |
d9c7b9b
to
e4e6667
Compare
@christolliday @KapJI I've rebased this PR and added the changes to make it independent of Nix so we no longer require a custom worker image. I've also added a CI test for the persistent worker examples, the test requires a GitHub secret named |
I added |
@KapJI Thank you! That sounds great, I don't think it should require any additional configuration. I'll test and debug the CI setup and let you know if I run into anything. |
One issue I'm encountering is that repository secrets are not exposed to GH actions runs that are initiated from forks (as is the case with this PR) for security reasons (see here). Here's what I've done now:
@KapJI could I ask you to trigger a CI run of this PR from within the Buck2 repo to test the remote execution cases? (After convincing yourself that this PR doesn't do anything dodgy with the token). |
@KapJI has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@aherrmann It's running on #800 |
@KapJI Thanks! Unfortunately, it looks like the token is still not available:
Is |
@aherrmann Sorry for the late reply. I fixed the secret and now it seems to run: https://github.com/facebook/buck2/actions/runs/12300843378/job/34330007436?pr=800
On #800 I also rebased PR to the latest version where anyhow is removed and applied some code formatting. |
62b3f8a
to
1ecb88d
Compare
@KapJI Thank you! I've updated this PR to capture the rebase and formatting fixes. I looked into the CI failure on #800. The issue seems to be that the actions are cached and don't report local or persistent worker execution as expected. I had tested this script with a read-only BuildBuddy token, but it looks like Buck2 CI has a read+write token. I've addressed this issue by adding a "cache buster". An env-var input to the action that is changed each time the test runs. I also tried the I've tested these changes against a read+write BuildBuddy token as well, and the tests passed. Hopefully they'll pass on Buck2 CI as well now. Please let me know if there is anything else I should address before this PR is ready for merge. |
Yes, random cache buster env var is a better choice. |
The linter errors on CI seem unrelated to the changes introduced by this PR. |
I'm merging fixes for failing lints right now. Once those are merged, let's rebase and make sure CI is passing. |
`WorkerRunInfo` has two fields `worker` and `exe` for the persistent worker command or the non-worker mode command respectively. This commit changes the remote persistent worker example to use these appropriately to distinguish between worker and non-worker mode execution.
The Buck2 Rust implementation needs to use the worker command instead of the non-worker command for remote persistent worker execution mode.
We want to use a Buck2 managed hermetic Python toolchain. However, Python binaries generated by such a toolchain are sensitive to `PWD` and don't work in other working directories than the repository root. Unfortunately, `genrule` changes directory and is hence incompatible with such Python binaries. This commit defines a dedicated rule to invoke protoc without changing the working directory.
The previous setup used Nix to provide a Python toolchain and packages in a reproducible fashion. However, this requires dedicated remote worker images with the Nix store paths pre-populated which complicates the setup for testing on Buck2 CI. Using a hermetic Python toolchain avoids these issues and works on the standard remote execution image.
Remove the old genrule targets to generate the Python gRPC/protobuf bindings.
Without Nix it is no longer required to use a custom remote worker image.
It's useful to test multiple builds in parallel to catch potential issues related parallel worker requests. However, we also don't need to be excessive in the number of tests to not unnecessarily waste resources.
Requires a repository secret to be set up for the BuildBuddy API key named `BUILDBUDDY_API_KEY`.
The test wants to make sure that the actions are executed correct using either the remote persistent worker or running as individual actions on the remote execution system. Caching interferes with this test. This injects a cache-silo-key that changes each time to force a re-run of the action.
19e80bf
to
28975cb
Compare
@KapJI I've rebased on main. |
Closes #776.
Implements support for persistent workers in remote builds using the Bazel remote execution protocol and the approach documented in the Bazel remote persistent workers proposal:
https://github.com/bazelbuild/proposals/blob/main/designs/2021-03-06-remote-persistent-workers.md
Includes an example setup that works with
The Bazel remote persistent worker protocol includes an automatic fallback in cases where the remote execution system does not yet support persistent workers. To that end actions take the shape
The remote execution system separates worker arguments on the command-line from request arguments in the response file and adds the
--persistent_worker
flag.The demo worker included in the example in this PR distinguishes between Buck2 worker, Bazel remote worker, and one-shot modes depending on whether Buck2's
WORKER_SOCKET
, Bazel's--persistent_worker
flag, or neither is set.The example includes a README with detailed instructions how to test this feature.