Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm job enters BadConstraints after spot node is preempted #5731

Open
JosephDVL opened this issue Oct 2, 2023 · 5 comments
Open

Slurm job enters BadConstraints after spot node is preempted #5731

JosephDVL opened this issue Oct 2, 2023 · 5 comments
Labels

Comments

@JosephDVL
Copy link

Required Info:

  • AWS ParallelCluster version: 3.7.1
  • Full cluster configuration without any credentials or personal data:
config.yaml

Region: us-east-1
Image:
  Os: alinux2
SharedStorage:
  - Name: custom1
    StorageType: Ebs
    MountDir: shared
    EbsSettings:
      Size: $ebs_volume_size
HeadNode:
  InstanceType: t3.large
  Networking:
    SubnetId: $master_subnet_id
    ElasticIp: false
  Ssh:
    KeyName:
    AllowedIps: 
  LocalStorage:
    RootVolume:
      Size: 40
  CustomActions:
    OnNodeConfigured:
      Script: s3://.../cluster_init3.sh
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 20
  SlurmQueues:
  - Name: queue1
    ComputeResources:
    - Name: default-resource
      SpotPrice: $spot_price
      MaxCount: 200
      Instances:
        - InstanceType: c7i.large
        - InstanceType: c6i.large
        - InstanceType: c5.large
    AllocationStrategy: capacity-optimized
    ComputeSettings:
      LocalStorage:
        RootVolume:
          Size: 40
    CapacityType: SPOT
    CustomActions:
      OnNodeConfigured:
        Script: s3://.../cluster_init3.sh
    Networking:
      SubnetIds:
        - $compute_subnet_id
Monitoring:
  Dashboards:
    CloudWatch:
      Enabled: False

  • Cluster name:
  • Output of pcluster describe-cluster command.
describe-cluster

{
  "creationTime": "2023-09-29T16:15:44.319Z",
  "headNode": {
    "launchTime": "2023-09-29T16:20:20.000Z",
    "instanceId": "i-0f3fde39...",
    "publicIpAddress": "52.23....",
    "instanceType": "t3.large",
    "state": "running",
    "privateIpAddress": "172.31...."
  },
  "version": "3.7.1",
  "clusterConfiguration": {
    "url": "https://parallelcluster-...-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.7.1/clusters/..."
  },
  "tags": [
    {
      "value": "3.7.1",
      "key": "parallelcluster:version"
    },
    {
      "value": "multi-01",
      "key": "parallelcluster:cluster-name"
    }
  ],
  "cloudFormationStackStatus": "CREATE_COMPLETE",
  "clusterName": "multi-01",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:...:stack/multi-01/",
  "lastUpdatedTime": "2023-09-29T16:15:44.319Z",
  "region": "us-east-1",
  "clusterStatus": "CREATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

  • [Optional] Arn of the cluster CloudFormation main stack:

Bug description and how to reproduce:
When running a cluster with SPOT instances, slurm jobs will enter a BadConstraints state after a node preemption. This behavior has been observed in both PCluster 2.11.x and 3.7.1.

We are submitting embarrassingly parallel jobs in an automated fashion such every job has the same submission process and requirements. Occasionally, a slurm job will enter a BadConstraints state after a node is preempted. For example:

# squeue | egrep 24[04]
               240    queue1 both_inn ec2-user PD       0:00      1 (BadConstraints)
               244    queue1 both_inn ec2-user  R 1-23:48:56      1 queue1-dy-default-resource-73

scontrol doesn't show much difference:

# scontrol show jobid=240
JobId=240 JobName=both_inner_loop
   UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A
   Priority=0 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=BadConstraints FailedNode=queue1-dy-default-resource-142 Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-09-29T21:26:18 EligibleTime=2023-09-29T21:26:18
   AccrueTime=2023-09-29T21:26:18
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-10-02T19:38:53 Scheduler=Main
   Partition=queue1 AllocNode:Sid=ip-172-31-17-218:14496
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=queue1-dy-default-resource-142
   BatchHost=queue1-dy-default-resource-142
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=3891M,node=1,billing=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/shared/
   WorkDir=/shared/
   StdErr=/shared/
   StdIn=/dev/null
   StdOut=/shared/
   Power=

# scontrol show jobid=244
JobId=244 JobName=both_inner_loop
   UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A
   Priority=4294901516 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-23:49:15 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-09-29T21:26:33 EligibleTime=2023-09-29T21:26:33
   AccrueTime=2023-09-29T21:26:33
   StartTime=2023-09-30T21:40:16 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-09-30T21:40:16 Scheduler=Main
   Partition=queue1 AllocNode:Sid=ip-172-31-17-218:14496
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=queue1-dy-default-resource-73
   BatchHost=queue1-dy-default-resource-73
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=3891M,node=1,billing=1
   AllocTRES=cpu=2,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/shared/
   WorkDir=/shared/
   StdErr=/shared/
   StdIn=/dev/null
   StdOut=/shared/
   Power=

Looking at /var/log/slurmctld.log

# egrep 24[04] slurmctld.log
[2023-09-29T21:26:18.710] _slurm_rpc_submit_batch_job: JobId=240 InitPrio=4294901520 usec=698
[2023-09-29T21:26:33.724] _slurm_rpc_submit_batch_job: JobId=244 InitPrio=4294901516 usec=581
[2023-09-30T20:50:10.734] sched: Allocate JobId=240 NodeList=queue1-dy-default-resource-142 #CPUs=2 Partition=queue1
[2023-09-30T21:40:16.250] sched: Allocate JobId=244 NodeList=queue1-dy-default-resource-73 #CPUs=2 Partition=queue1
[2023-10-01T01:36:13.873] requeue job JobId=240 due to failure of node queue1-dy-default-resource-142
[2023-10-01T01:40:06.870] cleanup_completing: JobId=240 completion process took 218 seconds
[2023-10-01T01:40:51.948] _pick_best_nodes: JobId=240 never runnable in partition queue1
[2023-10-01T01:40:51.948] sched: schedule: JobId=240 non-runnable: Requested node configuration is not available
[2023-10-01T07:23:53.189] _pick_best_nodes: JobId=240 never runnable in partition queue1
[2023-10-01T07:23:53.189] sched: schedule: JobId=240 non-runnable: Requested node configuration is not available

The last message repeats each time the scheduler is run such that the job never runs.

Corresponding messages in /var/log/parallelcluster/clustermgtd

2023-10-01 01:36:13,811 - [slurm_plugin.slurm_resources:is_backing_instance_valid] - WARNING - Node state check: no corresponding instance in EC2 for node queue1-dy-default-resource-142(172.31.36.119), node state: ALLOCATED+CLOUD
2023-10-01 01:36:13,814 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy dynamic nodes: (x1) ['queue1-dy-default-resource-142(172.31.36.119)']

The job was configured in a way that it was able to start once. However, after a preemption, I can't seem to find a way to remove the BadConstraints on the job to allow the scheduler to run the job again. The only fix we've been able to get work is to scancel the job and resubmit a similar job. Furthermore, newly scheduled jobs are able to get nodes provisioned and run successfully.

@JosephDVL JosephDVL added the 3.x label Oct 2, 2023
@JosephDVL
Copy link
Author

I've found a reference to an old Slurm mailing list post (https://groups.google.com/g/slurm-users/c/kshbXbqpEIY/m/nJcTyVQiIAAJ) which seems to address the issue. The fix in the e-mail does work for this case:

scontrol update jobid=240 NumNodes=1-1

Per the e-mail, if the job script includes “-N 1” everything works correctly.

@gwolski
Copy link

gwolski commented Dec 16, 2024

I'm seeing the same issue in using parallecluster 3.9.1 and 3.11.1.

$ squeue -u <user_redacted>
CLUSTER          PARTITION       JOBID            STATE       USER     NAME                 TIME NODES CPUS MIN_MEMORY FEATURES        DEPENDENCY LICENSES NODELIST(REASON)                   
tsi4             sp-16-gb        104266           PENDING     <user_redacted> tvrun                0:00     1    1          0 (null)          (null)     (null)   (BadConstraints)                   
tsi4             sp-16-gb        99638            PENDING     <user_redacted> tvrun                0:00     1    1          0 (null)          (null)     (null)   (BadConstraints)                   
tsi4             sp-16-gb        99636            PENDING     <user_redacted> tvrun                0:00     1    1          0 (null)          (null)     (null)   (BadConstraints)                   
tsi4             sp-16-gb        99629            PENDING     <user_redacted> tvrun                0:00     1    1          0 (null)          (null)     (null)   (BadConstraints) 
$ scontrol show job 104266
JobId=104266 JobName=tvrun
   UserId=<user_redacted>(XXXXX) GroupId=users(1000) MCS_label=N/A
   Priority=0 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2024-12-13T18:10:27 EligibleTime=2024-12-13T18:10:27
   AccrueTime=2024-12-13T18:10:27
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-12-16T14:31:57 Scheduler=Main
   Partition=sp-16-gb AllocNode:Sid=wssim0:3365371
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=sp-r7i-l-dy-sp-16-gb-1-cores-8
   BatchHost=sp-r7i-l-dy-sp-16-gb-1-cores-8
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=15564M,node=1,billing=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=<directory_redacted>/sim/tvrun_sbatch.sh
   WorkDir=<directory_redacted>/sim
   StdErr=<directory_redacted>/tvrun-104266.err
   StdIn=/dev/null
   StdOut=<directory_redacted>/tvrun-104266.out
   Power=
   TresPerTask=cpu:1
$

In /var/log/slurmctld.log.1:

[2024-12-13T18:10:27.911] _slurm_rpc_submit_batch_job: JobId=104266 InitPrio=4294835991 usec=641
[2024-12-13T18:10:28.000] sched: Allocate JobId=104266 NodeList=sp-r7i-l-dy-sp-16-gb-1-cores-8 #CPUs=1 Partition=sp-16-gb
[2024-12-13T18:10:30.464] sched: _slurm_rpc_allocate_resources JobId=104267 NodeList=od-r7a-m-dy-od-8-gb-1-cores-3 usec=575
[2024-12-13T18:10:40.013] agent/is_node_resp: node:sp-r7i-l-dy-sp-16-gb-1-cores-8 RPC:REQUEST_BATCH_JOB_LAUNCH : Communication connection failure
[2024-12-13T18:10:40.013] agent/is_node_resp: node:sp-r7i-l-dy-sp-16-gb-1-cores-8 RPC:REQUEST_LAUNCH_PROLOG : Communication connection failure
[2024-12-13T18:10:40.284] _job_complete: JobId=104266 WEXITSTATUS 0
[2024-12-13T18:10:40.284] _job_complete: requeue JobId=104266 per user/system request
[2024-12-13T18:10:40.284] _job_complete: JobId=104266 done
[2024-12-13T18:10:44.681] update_node: node sp-r7i-l-dy-sp-16-gb-1-cores-8 reason set to: Scheduler health check failed
[2024-12-13T18:10:44.681] powering down node sp-r7i-l-dy-sp-16-gb-1-cores-8
[2024-12-13T18:10:44.681] update_node: node sp-r7i-l-dy-sp-16-gb-1-cores-9 reason set to: Scheduler health check failed
[2024-12-13T18:10:44.681] powering down node sp-r7i-l-dy-sp-16-gb-1-cores-9
[2024-12-13T18:10:48.000] _pick_best_nodes: JobId=104266 never runnable in partition sp-16-gb
[2024-12-13T18:10:48.000] sched: schedule: JobId=104266 non-runnable: Requested node configuration is not available
[2024-12-13T18:10:48.221] _job_complete: JobId=104262 WEXITSTATUS 0
[2024-12-13T18:10:48.221] _job_complete: JobId=104262 done
[2024-12-13T18:10:56.066] _job_complete: JobId=104264 WEXITSTATUS 0
[2024-12-13T18:10:56.066] _job_complete: JobId=104264 done
[2024-12-13T18:11:02.001] POWER: power_save: suspending nodes sp-r7i-l-dy-sp-16-gb-1-cores-[8-9]

and then sometime later on, the node is spun up and works just fine, but the original job is stuck:

$ egrep "104266|sp-r7i-l-dy-sp-16-gb-1-cores-8" slurmctld.log.1
[2024-12-13T18:10:27.911] _slurm_rpc_submit_batch_job: JobId=104266 InitPrio=4294835991 usec=641
[2024-12-13T18:10:28.000] sched: Allocate JobId=104266 NodeList=sp-r7i-l-dy-sp-16-gb-1-cores-8 #CPUs=1 Partition=sp-16-gb
[2024-12-13T18:10:40.013] agent/is_node_resp: node:sp-r7i-l-dy-sp-16-gb-1-cores-8 RPC:REQUEST_BATCH_JOB_LAUNCH : Communication connection failure
[2024-12-13T18:10:40.013] agent/is_node_resp: node:sp-r7i-l-dy-sp-16-gb-1-cores-8 RPC:REQUEST_LAUNCH_PROLOG : Communication connection failure
[2024-12-13T18:10:40.284] _job_complete: JobId=104266 WEXITSTATUS 0
[2024-12-13T18:10:40.284] _job_complete: requeue JobId=104266 per user/system request
[2024-12-13T18:10:40.284] _job_complete: JobId=104266 done
[2024-12-13T18:10:44.681] update_node: node sp-r7i-l-dy-sp-16-gb-1-cores-8 reason set to: Scheduler health check failed
[2024-12-13T18:10:44.681] powering down node sp-r7i-l-dy-sp-16-gb-1-cores-8
[2024-12-13T18:10:48.000] _pick_best_nodes: JobId=104266 never runnable in partition sp-16-gb
[2024-12-13T18:10:48.000] sched: schedule: JobId=104266 non-runnable: Requested node configuration is not available
[2024-12-13T18:12:38.001] _pick_best_nodes: JobId=104266 never runnable in partition sp-16-gb
[2024-12-13T18:12:38.001] sched: schedule: JobId=104266 non-runnable: Requested node configuration is not available
[2024-12-13T18:13:38.001] _pick_best_nodes: JobId=104266 never runnable in partition sp-16-gb
[2024-12-13T18:13:38.001] sched: schedule: JobId=104266 non-runnable: Requested node configuration is not available
[2024-12-13T18:14:38.001] _pick_best_nodes: JobId=104266 never runnable in partition sp-16-gb
[2024-12-13T18:14:38.001] sched: schedule: JobId=104266 non-runnable: Requested node configuration is not available
[2024-12-13T18:16:39.001] _pick_best_nodes: JobId=104266 never runnable in partition sp-16-gb
[2024-12-13T18:16:39.001] sched: schedule: JobId=104266 non-runnable: Requested node configuration is not available
[2024-12-13T18:17:38.001] _pick_best_nodes: JobId=104266 never runnable in partition sp-16-gb
[2024-12-13T18:17:38.001] sched: schedule: JobId=104266 non-runnable: Requested node configuration is not available
...
The messages just continue on forever....

I will try --nodes=1 but we shouldn't have to do this.  I never had this problem when using earlier versions of paralelcluster, i.e. 3.2

@gwolski
Copy link

gwolski commented Jan 1, 2025

--nodes=1 in my submission script did not fix this. Here is my script that gets handed to sbatch:
#!/bin/bash
#SBATCH --exclusive
#SBATCH --job-name=spotrestart3
#SBATCH --cpus-per-task=1
#SBATCH --partition=sp-16-gb-1-cores
#SBATCH --output=spotrestart3-%j.out
#SBATCH --error=spotrestart3-%j.err
#SBATCH --nodes=1
echo running on $(hostname)
/usr/bin/time sleep 36000
sleep 2

I can reproduce the failure. Will file a new issue with the details.

@gwolski
Copy link

gwolski commented Jan 17, 2025

@JosephDVL you might want to review issue #6641 (comment) that I filed for my situation where your workaround didn't work. @hanwen-cluster of the parallelcluster team has reproduced the issue. Do you use the EnableMemoryBasedScheduling: true option?

@JosephDVL
Copy link
Author

@gwolski Thanks for the ping. We don't use EnableMemoryBasedScheduling: true, but I'll monitor the issue that you linked to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants