This a container to run SLEAP jobs on the University of Washington Hyak cluster. You can use this container to run SLEAP training and prediction jobs on Hyak in a GPU-accelerated environment. s, which provide reproducible environments that can run anywhere and be shared with other researchers.
Before running this container, you'll need the following:
- A Linux, macOS, or Windows machine
- An SSH client (usually included with Linux and macOS, and available for Windows through the built-in SSH client on Windows 10+, WSL2 or Cygwin).
- Hyak Klone access with compute resources
Follow the instructions below to set up your machine correctly:
If you are using Linux, OpenSSH is probably installed already -- if not, you can install it via apt-get install openssh-client
on Debian/Ubuntu or yum install openssh-clients
on RHEL/CentOS/Rocky/Fedora. To open a terminal window, search for "Terminal" in your desktop environment's application launcher.
If you're on macOS, OpenSSH will already be installed. To open a terminal window, open /Applications/Utilities/Terminal.app
or search for "Terminal" in Launchpad or Spotlight.
On Windows 10+, you can use the built-in SSH client. You may also install a SSH client through WSL2 or Cygwin (not recommended, needs additional setup). See the links for instructions on how to install these. You can start a terminal window by searching for "Terminal" in the Start menu.
Before you are allowed to connect to a compute node where your SLEAP job will be running, you must add your SSH public key to the authorized keys on the login node of the Hyak Klone cluster.
If you don't, you will receive an error like this when you try to connect to the compute node:
Permission denied (publickey,gssapi-keyex,gssapi-with-mic)
To set this up quickly on Linux, macOS, or Windows (WSL2/Cygwin), open a new terminal window on your machine and enter the following 2 commands before you try again. Replace your-uw-netid
with your UW NetID:
[ ! -r ~/.ssh/id_rsa ] && ssh-keygen -t rsa -b 4096 -N '' -C "your-uw-netid@uw.edu" -f ~/.ssh/id_rsa
ssh-copy-id -o StrictHostKeyChecking=no -i ~/.ssh/id_rsa your-uw-netid@klone.hyak.uw.edu
See the Hyak documentation for more information.
Apptainer containers can take up several gigabytes of space each. By default, Apptainer will store cached containers in your home directory (~
), under ~/.cache/apptainer
. However, because home directory space on Hyak is limited to 10 GiB per user, you may want to set up a different cache directory.
We advise setting up a cache directory under the /tmp
directory or in the scrubbed directory, under /gscratch/scrubbed/your-uw-netid
. To set this up, first connect to klone.hyak.uw.edu
via SSH:
ssh your-uw-netid@klone.hyak.uw.edu # Replace your-uw-netid with your UW NetID
Once you're logged in, create a directory for the cache and set the APPTAINER_CACHEDIR
environment variable to point to it:
mkdir -p "/gscratch/scrubbed/$USER/apptainer-cache" && export APPTAINER_CACHEDIR="/gscratch/scrubbed/$USER/apptainer-cache"
Finally, add the following line to your ~/.bashrc
file (or ~/.zshrc
if you use ZSH) to retain this setting across multiple logins:
echo "export APPTAINER_CACHEDIR=\"/gscratch/scrubbed/$USER/apptainer-cache\"" >> ~/.bashrc
This guide assumes that you are running SLEAP on your own machine, with an open SLEAP project that you are ready to start training on. If you need help creating a SLEAP project, consult the SLEAP documentation.
To start training your model on the cluster, you must first create a training package:
A self-contained training job package contains a .slp file with labeled data and images which will be used for training, as well as .json training configuration file(s). *
You can create a training job package in the sleap-label
GUI by following the Run Training...
option under the Predict
menu:
Set the parameters for your training job (refer to SLEAP documentation if you're not sure), and click Export training job package
once you're done:
Next, you should see a dialog that says, Created training job package.
Click Show Details...
:
The full file path to the training package will be displayed (e.g., /home/me/sleap/my_training_job.zip
). Select and copy this path:
Now you must use the terminal on your computer to upload the training package to the Hyak cluster. You can find instructions on how to set up your terminal to access Hyak here.
Open a terminal window on your computer and enter the following command to copy the training package to your home directory (~
) on the cluster:
scp /home/me/sleap/my_training_job.zip your-uw-netid@klone.hyak.uw.edu: # Replace your-uw-netid with your UW NetID
NOTE: You may need to log in with your UW NetID and two-factor authentication.
Once the file has been copied, log in to the cluster via SSH:
ssh your-uw-netid@klone.hyak.uw.edu # Replace your-uw-netid with your UW NetID
The training package should be located in your home directory on klone
. You can check by running ls
:
ls *.zip # Should display all ZIP files in directory, including `my_training_job.zip`
Unzip the package file to a new directory. Let's call it training_job
:
unzip my_training_job.zip -d training_job
We are almost ready to launch the container. First, though, we need to allocate a job on the cluster. We will use the salloc
command to do this.
The following command will allocate a job on one node with 4 GPUs, 64 GB of memory, and 8 CPUs for 24 hours on the gpu-a40
partition available to the escience
account. You can adjust these parameters as needed. For more information on the salloc
command, see this page and the salloc documentation.
salloc --job-name sleap-train-test \
--account escience \
--partition gpu-a40 \
--gpus 4 \
--ntasks 1 \
--gpus-per-task=4 \
--mem 64G \
--cpus-per-task 4 \
--time 24:00:00
When the allocation is ready, you will automatically connect to the compute node. When you exit this session, the allocation will automatically be released.
Once you are connected to the node, you can verify that the SLEAP container has access to the GPUs by running the following command:
apptainer run --nv --bind /gscratch oras://ghcr.io/maouw/sleap-container:latest python -c "import sleap; sleap.system_summary()"
You should get output that looks something like this:
GPUs: 4/4 available
Device: /physical_device:GPU:0
Available: True
Initalized: False
Memory growth: None
Device: /physical_device:GPU:1
Available: True
Initalized: False
Memory growth: None
Device: /physical_device:GPU:2
Available: True
Initalized: False
Memory growth: None
Device: /physical_device:GPU:3
Available: True
Initalized: False
Memory growth: None
6.40s user 6.24s system 88% cpu 14.277s total
Now, navigate to the directory where you unzipped the training package:
cd ~/training_job
The next step is to launch the container:
apptainer run --nv --bind /gscratch oras://ghcr.io/maouw/sleap-container:latest bash train-script.sh
Apptainer will download the container image from GitHub and launch it on the node. The option --nv
enables Nvidia GPU support. Once the container has launched, it will instruct bash
to run the script train-script.sh
. This script will start the training job.
During training, you will see a lot of output in the terminal. After some time, if training is successful, the last of the output should look something similar to this:
INFO:sleap.nn.evals:Saved predictions: models/231009_165437.centered_instance/labels_pr.train.slp
INFO:sleap.nn.evals:Saved metrics: models/231009_165437.centered_instance/metrics.train.npz
INFO:sleap.nn.evals:OKS mAP: 0.205979
Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% ETA: 0:00:00 3.3 FPS
INFO:sleap.nn.evals:Saved predictions: models/231009_165437.centered_instance/labels_pr.val.slp
INFO:sleap.nn.evals:Saved metrics: models/231009_165437.centered_instance/metrics.val.npz
INFO:sleap.nn.evals:OKS mAP: 0.064026
229.63s user 44.64s system 77% cpu 5:53.45s total
Once training finishes, you'll see a new directory (or two new directories for top-down training pipeline) containing all the model files SLEAP needs to use for inference:
ls models/
231009_165437.centered_instance 231009_165437.centroid
You can use these model files to run inference on your own computer, or you can run inference on the cluster (consult the SLEAP documentation for more information).
To copy the model files back to your computer, in a terminal where you are logged into klone.hyak.uw.edu
, compress the model directory with zip
:
cd ~/training_job
zip -r trained_models.zip models
Then, in a new terminal window on your own computer, use the scp
command to copy the model files from klone
to your computer:
scp your-uw-netid@klone.hyak.uw.edu:~/training_job/trained_models.zip . # Replace your-uw-netid with your UW NetID
This will copy the file trained_models.zip
to your current directory. You can then unzip the file and use the model files for inference on your own computer. Consult the SLEAP documentation for more information on running inference with a trained model.
Be sure to end your cluster job when you are done! This will free up resources for other users and potentially prevent you from being charged for time you are not using.
To do this, go back to the terminal where you were running SLEAP on the cluster. (If you closed the terminal, you can log back in to the cluster with ssh klone.hyak.uw.edu
.)
If you're still logged in to the compute node, exit:
exit
Cancel the job allocation with the scancel
command:
scancel --me --jobname sleap-train-test
Finally, exit the cluster:
exit
SLEAP well!