Skip to content

How calculate‐docker‐image works in PyTorch CI

Huy Do edited this page Jan 8, 2025 · 8 revisions

What is this about?

The GH composite action calculate-docker-image is a fairly complex GHA and it's probably trying to do more things than it should be. The GHA is used throughout PyTorch CI whenever a Docker image is involved, for example PyTorch _linux_build/_linux_test, Docker build, or Nova Linux job.

At a high level, the action does two main things:

  1. Given a docker image name, it checks for the availability of the image on our AWS ECR 308535385114.dkr.ecr.us-east-1.amazonaws.com.
  2. When the requested image is not available, the GHA will try to build the Docker image so that the workflow calling it has the image to continue what it's doing.

The actual implementation of the GHA, however, is more complex with a set of parameters interacting with each other in a not-so-straightforward way. So, when it works, it's great. But when there are issues, it's not an easy thing to debug.

How calculate-docker-image really works

Let's start with the list of parameters of the GHA. There are 7 of them at the moment. The trio docker-image-name, docker-build-dir, and docker-registry are used to check for the image on ECR. While the other three amigos always-rebuild, push, and force-push controls the build process. The last parameter working-directory plays a smaller role and it is only used by Nova workflow to point to the checkout repository.

1. Check for the image on ECR

  1. docker-image-name. This is the name of the Docker image that the GHA is looking for.
  2. docker-build-dir. This is the directory where the Docker build script exists. The convention here is to use .ci/docker directory, for example PyTorch or ExecuTorch. The GHA will look for a build.sh script under this directory to trigger Docker build process.
  3. docker-registry. This is just default to 308535385114.dkr.ecr.us-east-1.amazonaws.com until maybe we decide to move the ECR to somewhere else like LF AWS account.

These parameters are used to check if the request Docker image exists. Simple, right? Not so fast. Let's take a look at how the check is performed.

---
title: How calculate-docker-image checks for a Docker image
---
flowchart TD
   B@{ shape: circle, label: "Start" }
     --> check-short-name@{shape: diamond, label: "Is short name?"}

   check-short-name
      -->|y| short-name@{ shape: lean-r, label: "#36;#123;DOCKER_IMAGE_NAME#125;, i.e. pytorch-linux-focal-linter" }
      --> compute-short-form-tag[**Compute the docker image tag** using git rev-parse HEAD:#34;#36;#123;DOCKER_BUILD_DIR#125;#34;. The tag depends on the content of DOCKER_BUILD_DIR. When files in that directory are updated, a new tag is generated signifying that a new Docker image is needed#185;#178;]
      --> full-name@{ shape: lean-r, label: "#36;#123;DOCKER_REGISTRY#125;/#36;#123;REPO_NAME#125;/#36;#123;DOCKER_IMAGE_NAME#125;:#36;#123;DOCKER_TAG#125;, i.e. 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-linter:88e25063afb45411a0d16539a1335bd864b6f2be" }

    check-short-name
      -->|n| full-name

    full-name
      --> login-ecr[Login to ECR]
      --> check-image-exist@{shape: diamond, label: "Does it exist?"}
    
    check-image-exist
      -->|y| E@{ shape: circle, label: "Stop" }

    check-image-exist
      -->|n| should-build-image@{shape: diamond, label: "PUSH is set?"}

    should-build-image
      -->|y| build-image[[Build the image]]

    should-build-image
      -->|n| wait((Wait up to 90 minutes#179;#8308;))
 
    wait
      --> should-build-image-2@{shape: diamond, label: "Is it there yet?"}

    should-build-image-2
      -->|y| E

    should-build-image-2
      -->|n| build-image

    build-image
      --> E
Loading

Footnotes:

  1. Using a bash script to copy files into .ci/docker directory is an anti-pattern because the copied files are not version tracked in git and this leads to stale tags and other nasty businesses. It's a very easy mistake to make because Docker build only accept files in the build context https://docs.docker.com/build/concepts/context, a.k.a .ci/docker folder. So, people wrongly assume that they could just copy any files they need there before the build starts by tweaking .ci/docker/build.sh script.
  2. The GHA requires a full checkout because it performs a check against the merge base (for PR) or the parent comment (for trunk commit). Maybe there is away to get rid of this check, but it's not that critical because PyTorch performs a full checkout in CI anyway.
  3. The 90-minute wait was a recent change from https://github.com/pytorch/pytorch/issues/141885. PyTorch Dockers images have grown to the point that it couldn't be built on a linux.2xlarge runner anymore and would fail with either a OOM error or timing out. So, https://github.com/pytorch/test-infra/pull/6013 made it so that when a new Docker image is needed, all the build jobs running on linux.2xlarge will wait up to 90 minutes for the dedicated Docker build job running on a larger linux.12xlarge to get the image ready. Once it's there, PyTorch build jobs will continue as usual.
  4. Building the new Docker image on a dedicated Docker build job will also address the rate limit issue to docker.io, which will be covered in the next section.

2. Build the image

  1. always-rebuild. As its name implies, if this is set, the above check will be skipped and the image will always be built.
  2. push. If this is set, the GHA will upload the image to ECR ONLY WHEN it doesn't exist.
  3. force-push. If this is set and if push is also set, the action will always upload the image to ECR.
---
title: How calculate-docker-image builds a new Docker image
---
flowchart TD
  B@{ shape: circle, label: "Start" }
    --> check-image[[Check for the image on ECR ]]
    --> check-image-exist@{shape: diamond, label: "Does it exist?"}

  check-image-exist
    -->|y| always-rebuild@{shape: diamond, label: "Always rebuild?"}

  always-rebuild
    -->|n| E@{ shape: circle, label: "Stop" }

  always-rebuild
    -->|y| login-docker[Login to docker.io#185;]
    --> build-image[Build docker image by calling build.sh in .ci/docker]

  check-image-exist
    -->|n| login-docker

  build-image
    --> should-push-image@{shape: diamond, label: "PUSH is set?"}

  should-push-image
    -->|n| E

  should-push-image
    -->|y| check-image-exist-2@{shape: diamond, label: "Check if the image is there on ECR again?"}
    -->|n| push-image[Push the image to ECR]
    -->E

  check-image-exist-2
    -->|y| should-force-push-image@{shape: diamond, label: "FORCE_PUSH is set?"}
    -->|n| E

  should-force-push-image
    -->|y| push-image
Loading

Straightforward, eh?

Footnotes:

  1. Logging in to docker.io is needed because the base docker image is usually there. If the base image comes from elsewhere, for example quay.io, we might need to login there too but it's not implemented at the moment. Logging in to docker.io is done at both the runner level using the post installation script and at the workflow level as a step in the GHA. The credential is stored on AWS secrets manager that is accessible only by the runner. It's a read-only credential.

How PyTorch CI uses calculate-docker-image

  1. As part of _linux_build and _linux_test workflows. The requested Docker image is guaranteed to be there before these workflows pull them locally. The push parameter is not set here, so they won't pushed anything to ECR.
  2. As part of Docker build workflows. This is the dedicated workflows to build all Docker images used by PyTorch CI. It sets always-rebuild and push parameters, so we know for sure that the Docker image will be built and pushed to ECR if it's not there yet. The always-rebuild parameter is there to ensure that the image is rebuilt periodically in trunk and any failures there will surface early.
  3. As part of CD workflow to build manywheel images. The same principal applies.

Testing calculate-docker-image changes

  1. Push a non-fork PR to test-infra with the change, i.e. https://github.com/pytorch/test-infra/pull/6013
  2. Create a test PR on PyTorch using the branch from step 1 to trigger the workflows there, i.e. https://github.com/pytorch/pytorch/pull/142177/commits/cc45015329fd579a9bbc4e75a5676fb66f17d604