-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm job enters BadConstraints after spot node is preempted #5731
Comments
I've found a reference to an old Slurm mailing list post (https://groups.google.com/g/slurm-users/c/kshbXbqpEIY/m/nJcTyVQiIAAJ) which seems to address the issue. The fix in the e-mail does work for this case:
Per the e-mail, if the job script includes “-N 1” everything works correctly. |
I'm seeing the same issue in using parallecluster 3.9.1 and 3.11.1.
In /var/log/slurmctld.log.1:
and then sometime later on, the node is spun up and works just fine, but the original job is stuck:
|
--nodes=1 in my submission script did not fix this. Here is my script that gets handed to sbatch: I can reproduce the failure. Will file a new issue with the details. |
@JosephDVL you might want to review issue #6641 (comment) that I filed for my situation where your workaround didn't work. @hanwen-cluster of the parallelcluster team has reproduced the issue. Do you use the |
@gwolski Thanks for the ping. We don't use |
Required Info:
config.yaml
pcluster describe-cluster
command.describe-cluster
Bug description and how to reproduce:
When running a cluster with SPOT instances, slurm jobs will enter a BadConstraints state after a node preemption. This behavior has been observed in both PCluster 2.11.x and 3.7.1.
We are submitting embarrassingly parallel jobs in an automated fashion such every job has the same submission process and requirements. Occasionally, a slurm job will enter a BadConstraints state after a node is preempted. For example:
scontrol doesn't show much difference:
Looking at /var/log/slurmctld.log
The last message repeats each time the scheduler is run such that the job never runs.
Corresponding messages in /var/log/parallelcluster/clustermgtd
The job was configured in a way that it was able to start once. However, after a preemption, I can't seem to find a way to remove the BadConstraints on the job to allow the scheduler to run the job again. The only fix we've been able to get work is to scancel the job and resubmit a similar job. Furthermore, newly scheduled jobs are able to get nodes provisioned and run successfully.
The text was updated successfully, but these errors were encountered: