-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial setup of Payu environment #1
Conversation
06b8986
to
58da917
Compare
58da917
to
74a8e02
Compare
…d payu environment
…ctions - workflows/pull_requent.yml: Split up setup and move build base image to separate job - workflows/get_changed_env.yml: Remove deleted environment from matrix - Update workflows to source environment variables from install_config.sh
…se launcher script
… base Removed the modified micromamba as at this stage we might not need compatibility with nb_conda_kernals
Just some more quick details on how it's been tested. I manually ran all the Setup commandbash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"
source "$REPO_PATH/scripts/install_config.sh"
source "$REPO_PATH/scripts/functions.sh"
mkdir -p "${ADMIN_DIR}" "${JOB_LOG_DIR}" "${BUILD_STAGE_DIR}"
set_admin_perms "${ADMIN_DIR}" "${JOB_LOG_DIR}" "${BUILD_STAGE_DIR}"
echo "${ADMIN_DIR}" "${CONDA_BASE}" "${JOB_LOG_DIR}"
echo "Finished setup!"
EOF Build commandbash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export SCRIPT_DIR="$REPO_PATH/scripts"
export CONDA_ENVIRONMENT="payu"
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"
PROJECT="tm70"
STORAGE="gdata/tm70"
source "${SCRIPT_DIR}"/install_config.sh
cd "${JOB_LOG_DIR}"
qsub -N build_"${CONDA_ENVIRONMENT}" -lncpus=1,mem=20GB,walltime=2:00:00,jobfs=50GB,storage="${STORAGE}" \
-v SCRIPT_DIR,CONDA_ENVIRONMENT,ADMIN_DIR,CONDA_BASE,APPS_USERS_GROUP,APPS_OWNERS_GROUP \
-P "${PROJECT}" -q copyq -Wblock=true -Wumask=037 \
"${SCRIPT_DIR}"/build.sh
echo "Finished Build!"
EOF Test commandbash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export SCRIPT_DIR="$REPO_PATH/scripts"
export CONDA_ENVIRONMENT="payu"
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"
PROJECT="tm70"
STORAGE="gdata/tm70"
source "${SCRIPT_DIR}"/install_config.sh
cd "${JOB_LOG_DIR}"
qsub -N test_"${CONDA_ENVIRONMENT}" -lncpus=4,mem=20GB,walltime=0:20:00,jobfs=50GB,storage="${STORAGE}" \
-v SCRIPT_DIR,CONDA_ENVIRONMENT,ADMIN_DIR,CONDA_BASE,APPS_USERS_GROUP,APPS_OWNERS_GROUP \
-P "${PROJECT}" -Wblock=true -Wumask=037 \
"${SCRIPT_DIR}"/test.sh
echo "Finished Test!"
EOF Deploy commandbash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export SCRIPT_DIR="$REPO_PATH/scripts"
export CONDA_ENVIRONMENT="payu"
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"
source "${SCRIPT_DIR}"/install_config.sh
"${SCRIPT_DIR}"/deploy.sh
echo "Finished Deploy!"
EOF Once everything was deployed, I tested modules with manually running the configuration repro tests (instructions here: https://github.com/ACCESS-NRI/model-config-tests/?tab=readme-ov-file#how-to-run-pytests-manually-on-nci), with module load With the workflows, in One thing that should be edited if deployed to Gadi, should be the |
… matrix - Fixed matrix to include changed environments that are substrings of other (e.g. payu and payu-dev)
This is to reduce the number of signoffs required for pull request and deploy jobs, so it's just once per modified environment
I've been testing the CI/CD workflows on a separate test organisation repository (https://github.com/jbcv-test-org/test-repository/actions). This includes:
Some new code changes:
|
I'm not a fan of the verbosity of this. Can we just |
I'd recommend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It is pretty damn complicated set of interconnected scripts etc. Give it a burl and see how it runs ... but of course I still have questions.
scripts/install_config.sh
Outdated
export CONDA_TEMP_PATH="${PBS_JOBFS:-${CONDA_TEMP_PATH}}" | ||
export SCRIPT_DIR="${SCRIPT_DIR:-$PWD}" | ||
|
||
export SCRIPT_SUBDIR="apps/cms_conda_scripts" | ||
export SCRIPT_SUBDIR="apps/conda_scripts" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a path in the container? If so it would be good to have a comment to that effect. I get quite lost with all the paths etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this script directory sits outside container, and contains the environment launcher scripts (for every file on $PATH
inside the squashfs environment) that launches a container and runs commands inside the containerised environment. I've added some brief documentation to this file in b4d57b9
environments/payu/build_inner.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the same as the payu-dev
version. Is there a reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so payu-dev
has a pip install of payu
, so payu entry point scripts (e.g. payu-run
, payu-collate
) have incorrect python shebang headers (point to location on /jobfs/
where the environments were built). Technically payu/build_inner.sh
could be changed to the same as payu-dev/build_inner.sh
as changing the python headers should have no effect.
Yeah I agree that it is overly verbose- I was just naming it To use |
- Add general modulepath config overrides to payu environment config file (environments/payu/config.sh) - Update payu deploy script to use these modulepaths - Add a MODULE_VERSION to use in the general build scripts (separate from FULLENV which is the name of the environment in the container and squashfs files) - Extend common modulefile to support modulefile names $ENVIRONMENT/$VERSION, as well as conda/$ENIRONMENT-$VERSION
To setup use
I've kept the general modulename configuration (e.g. I've removed I think once there's a new version of payu (1.1.6?), it would be great to release it to prerelease for testing. I've re-done some configuration checks with the latest changes to confirm that at least for an ACCESS-ESM1.5 configuration, the containerised payu reproduces a model run. |
I've noticed the deploy rsync command changes to ACLs and permissions on the
In testing on a separate directory, I tried to have similar ACLs and permissions to
The above seems to allow ACLs and permissions to be inherited from pre-existing
It shouldn't be possible to install anything in the conda environment as it's a squashfs file so it should be read only.
Does anyone have any preferences on with stripping out ACLs and permissions completely and using pre-existing ACLs or manually adding more restrictive acls, or breaking up rsyncs to preserve ACLs and permissions only on directories that are relevant to the conda installs? |
This is avoid ACLs and permissions of these directories being changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve the change to APPS_OWNER
over APPS_OWNERS_GROUP
, given tm70_ci
is probably going to be the point of contact for this stuff.
This PR has some initial work setting up containerized
squashfs
conda environments for payu using the work done by Dale for thehh5
's analysis conda containerised environments (https://github.com/coecms/cms-conda-singularity). As a quick overview this PR:main
payu branch on Github).What's cool with the
cms-conda-singularity
scripts is that it already works out of the box with building Python virtual environments on top of the squashfs conda environments. This is useful for the Repro CI tests run using payu inmodel-config-tests
(https://github.com/ACCESS-NRI/model-config-tests/). So I was able to use virtual environments to run the reproducibility tests for an ACCESS-OM2 configuration (tag:release-1deg_jra55_ryf-2.0
) and an ACCESS-ESM1.5 configuration (tag:release-historical+concentrations-1.1
) using payu and payu-dev as a base conda environments and everything passed.I've manually run the scripts in the workflows for building, testing and deploying environments (latest installs are in
/g/data/tm70/jb4202/tmp-conda/
). I am holding off running any CI deployment to Gadi workflows until installation paths and variables are finalised.Notes:
Base installation paths need to be in
/g/data/
:Initially, I was running into errors when manually running the build scripts with directories not existing and squashfs image not correctly being set up. The reason was that I was using
/scratch
directories as base directories (e.g.CONDA_BASE
), rather than/g/data/
- The build scripts assume the base directories where the environments will eventually be deployed to start with/g
.Pip installed packages:
Existing payu development environments install payu from the main branch. Pip-installed packages had incorrect shebang headers pointing to a directory on
/jobfs/
where the environment was initially built. There is already an issue for this: Issue with deployment of pip installed python packages with command line tools ACCESS-Analysis-Conda#78. I used Romain's solution here: https://github.com/ACCESS-NRI/MED-condaenv/blob/2c0f730b54cfa6a19b6df4300f8dd27cf3b877d0/environments/esmvaltool/build_inner.sh#L9Payu PBS qsub calls:
Payu submits jobs similar to
qsub -- path/to/env/python path/to/env/payu-run
(when running the commandpayu run
). Thispath/to/env/python
would point to a Python executable only accessible inside the container. Each of the environment commands in the container has a corresponding script outside the container (symlink tolauncher.sh
), that would launch the container and then run the command inside the container. I noticed when testing theconda_concept/analysis
modules in/g/data/hh5/public/modules/
, running the launcher python script with a payu command would have asys.executable
that points back to launcher python script. So running/g/data/hh5/public/apps/cms_conda_scripts/analysis3-24.04.d/bin/python /g/data/hh5/public/apps/cms_conda/envs/analysis3-24.04/bin/payu run
, would pass the launcher python script along to subsequent payu qsub submits. So for a somewhat hacky fix, I modified the Python shebang for the payu command to use the outside Python launcher script. (Why does thesys.executable
point to the Python launcher script? I think becauselauncher.sh
preserves the originalargv[0]
by usingexec -a
, e.g.exec -a /path/to/outside/python /path/to/inner-env/python /path/to/inner-env/payu-run
)An alternative solution to the above would be to modify the payu source code to add the launcher script to the qsub commands. E.g.
This approach is hard-coding a custom environment variable into payu - though it might make it easier for others to run payu inside a container as they will only need the
LAUNCHER_SCRIPT
environment variable to be defined. However, I am not sure how to guarantee this variable points to the correct script that launches the container that contains the payu environment.After chatting with Aidan, another solution would be if (when) Payu ends up using HPCPY (https://github.com/ACCESS-NRI/hpcpy) and payu had a templated script that runs qsub calls. The build scripts in this repository could modify that template, to add in the launcher script. There are also existing override command scripts in this repository so there probably is another solution to this problem.. In the meantime, while I am testing, I'm using the modified shebang header for payu commands as it doesn't require changes to payu.
Github Environment Variables:
@aidanheerdegen suggested moving the project-specific installation paths to Github where they can be set via Github Environment Variables. This is so paths can be changed without modifying the source code. Initially, I moved just the
ADMIN_DIR
(base directory for logs and staging environments tar files), andCONDA_BASE
(base directory which will contain theapps/
andmodules/
subdirectories). As the paths may also impact other configuration settings, e.g. project and storage flags passed to build qsub calls, and the groups used for configuring file permissions of admin and deployed directories (APPS_USERS_GROUP
andAPPS_OWNERS_GROUP
). So I moved those also to Github Variables.Proposed Github Variable settings for Gadi environment:
CONDA_BASE
:/g/data/vk83/prerelease
(the directory that contains apps/ and modules/ subdirectories)ADMIN_DIR
:/g/data/vk83/admin/conda_containers/prerelease
(directory to store staging and log files, tar files of conda environments, and backups of old environment squashfs files)APPS_USERS_GROUP
:vk83
(Permissions of read and execute for files installed to apps and modules)APPS_OWNERS_GROUP
:vk83_w
? (Read/write/executable permissions for installed files)PROJECT
:tm70
(Build and test PBS jobs project)STORAGE
:gdata/vk83
(Build and test PBS jobs storage directives)secrets.REPO_PATH
: ? (This is the path where all this repository is rsynced to and all the scripts are run from)The above settings,
install_config.sh
settings, and the current conda environments would add the following to/g/data/vk83/prerelease/
:So loading the modules would be
I've named the micromamba install directory
base_conda
and module namecontainer_container
so it does not clash with existingconda/
directories invk83
.Issues: (TODO: split off into separate Github Issues)
payu-dev
environment #2)environment/config.sh
that removed "openssh-clients", "openssh-server" and "openssh" from the environment, and include an outside "ssh" command. In the cms documentation for the conda environments (https://climate-cms.org/cms-wiki/resources/resources-conda-setup.html#technical-details), has "As a part of the installation process, the openssh packages are removed from the conda installation, which forces use of the system ssh and, more importantly, its configuration." So I am wondering if I will accidentally break something by removing those.Setup
,Build
andTest
jobs. As the settings for Gadi environment requires reviewers, this will require many signoffs in a Pull Request. This is fine for testing stage as can run through the logs, and manually check things between each step but might be unnecessary later on. Could move jobs into one job so it only requires one sign off to deploy to Gadi?