Manage Slurm#

Slurm provides an efficient and effective solution to manage workloads that require intensive computation or data processing with defined start and end times. Slurm uses a controller node, typically installed on an administrative node or virtual machine separate from your ClusterWareAI ™ head node, to monitor cluster resources and assign work to compute nodes. Compute nodes are added to Slurm either dynamically (default) or statically.

See Configure Job Schedulers to install and configure the optional Slurm package on a controller node to integrate Slurm with the ClusterWareAI software.

Monitor Slurm Status#

You can view the Slurm status on the controller by running:

slurm-cw.setup status

Start and stop the Slurm service cluster-wide by running:

slurm-cw.setup cluster-stop
slurm-cw.setup cluster-start

Add Slurm Nodes#

Dynamic Slurm nodes are automatically added after rebooting with the appropriate boot configuration and Slurm image (diskless nodes) or after Slurm is installed (diskful nodes).

For static Slurm nodes, you can add nodes by:

Running the following command:
```
slurm-cw.setup update-nodes <nodes>
```
Where <nodes> is replaced by:
- All “up” nodes: --up
- A list of nodes: -i n[0-2]
- An expression attribute: -s 'attributes[_boot_config]=="DefaultBoot"'
Directly editing the /etc/slurm/slurm.conf config file. See Work with Slurm Configuration File for details.

Note

With configless Slurm, the Slurm image does NOT need to be reconfigured after new static nodes are added. Slurm automatically forwards the new information to the slurmd daemons on the nodes.

Remove Dynamic Slurm Nodes#

After a dynamic node is added to Slurm (that is, scontrol remembers the node name), the node is not automatically removed from scontrol even if the node is removed or changed within the ClusterWareAI software. If you want to remove a dynamic node from scontrol, use the delete argument within scontrol.

For example, to remove node n0:

scontrol delete node n0.cluster.local

Basic Slurm Commands#

The following sections are a non-comprehensive list of common Slurm commands. Refer to the Slurm documentation for additional command details.

Tip

Depending on your Slurm configuration, some node, partition, queue, or other information may be hidden. If Slurm is configured with privacy settings, Slurm only shows information for the nodes you have a job running on.

sinfo#

sinfo shows information about cluster compute nodes, including node state.

Tip

Node states may be abbreviated. For example, ALLOCATED can also be ALLOC.

Common states include:

ALLOCATED: All of the node's available resources are currently running one or more user jobs. A node in this state is active and fully utilized.
COMPLETING: A node enters this state at the end of a job. A node is removed from this state when all of a job's processes terminate. Nodes can get stuck in this state, particularly if there is a problem with the node or a node that is a member of the same job.
DOWN: The node is unavailable for use. Typically nodes are in this state because an administrator or Slurm has deemed the node to be unhealthy. In most configurations of Slurm, the only way for a node to exit this state is for an administrator to run commands to remove the node from the state. While it is possible to have Slurm automatically move nodes from this state into an available state, it often leads to job loops and is generally not advised.
DRAINED: The node is unavailable per an administrators request. Nodes enter this state from the DRAINING state. Nodes in the DRAINED state have completed their jobs, meaning the drain has finished.
DRAINING: This state is the act of transitioning to DRAINED, but the node has not completed the job assigned. The system does not allow new jobs to run while the current jobs finish.
IDLE: The node is ready to accept jobs.
MAINT: The node is in a reservation that has a maintenance flag. A node in this state does not accept jobs from users, but an administrator could submit test jobs to the node through the associated reservation.
MIXED: Some, but not all, of the node's processors have jobs running. Another job could run on the node in parallel to the current job.
RESERVED: The node is in a reservation. There may or may not be a job running on the node - it is simply set aside for a specific criteria, typically a specific user, to use when they are ready. Other user jobs will not be scheduled on this node.

squeue#

squeue shows a list of jobs in the queue. Slurm manages resources over time. Often there are a few different states a job can be in. There are also often reasons why a particular job cannot start, which are shown using this command.

Usage of this command can be complex, but the basic usage shows the following information:

JOBID: An incremented integer that designates each new job.
PARTITION: Jobs can only run in a single partition. Running in a single partition is used to prevent jobs running across dissimilar hardware or potentially segregated networks that share a unified storage. Users can submit their job to multiple partitions but the job ultimately only runs on a single partition.
NAME: User submitted variable, primarily used to denote the job instead of using the job id.
USER: The user that submitted the job.
ST: Abbreviation for state. Typical states are: PENDING, RUNNING, SUSPENDED, COMPLETING, or COMPLETED.
TIME: How long the job has been running or waiting in the queue, depending on state.
NODES: Number of nodes the user requested.
NODELIST(REASON): Either a list of nodes that the job is currently running on or the top reason why the job is currently not running. See the Slurm documentation for a list of potential reasons.

srun#

srun runs a command immediately if nodes are available to run a job. srun is a useful command for an administrator when debugging Slurm. See Troubleshooting Slurm for examples.

sbatch#

sbatch submits a script to the queue system. See Troubleshooting Slurm for examples.

scontrol#

scontrol is the primary command for a Slurm administrator. The command modifies the current Slurm configuration or state, including any node, partition, reservation, or the overall configuration.

Restarting Slurm#

If any services on the controller (slurmctld, slurmdbd, and munge) or on the compute nodes (slurmd and munge) are not running, use systemctl to start the individual service. Alternatively, use the following commands:

To restart Slurm cluster-wide: slurm-cw.setup cluster-restart

To restart Slurm on the controller: slurm-cw.setup restart

To restart Slurm on nodes: slurm-cw.setup restart-nodes

Note

Starting or restarting does not affect the Slurm image.

Troubleshooting Slurm#

The ClusterWareAI software shows scheduler data in the compute node attributes. See Workload Management for information about monitoring and troubleshooting schedulers using ClusterWareAI tools and commands.

To troubleshoot outside of the ClusterWareAI software, start by running the following command:

scontrol ping

If the command returns UP for one or more controllers, then you can infer:

The node used to run the command either has a valid slurm.conf file or the SRV records pointing to the controller are functioning.
The slurmctld controller is functioning and accepting RPC commands.
The hostname of the nodes that contain the primary or backup controller.

If the command does not return, the problem could be one of the following:

There is no slurm.conf file or there are no SRV records pointing to the controller.
There is a network connectivity issue between the node and the controller.
The slurmctld service may be stopped on the controller nodes.

Use the following command to submit a simple job to the scheduler:

srun hostname

If the command is successful, then you can infer:

The slurmctld controller is functioning, accepted your request, and output the job.
At least one node in the cluster is functioning.
Your user has sufficient permissions to submit Slurm jobs.

If this command does not return or fails in any way:

There could be a problem with the Slurm configuration.
There are no nodes available to run the job.
Your user may not be allowed to submit jobs.

A next step could be running a larger srun command specifying the number of nodes required for the job. For example:

srun -N4 hostname

where -N4 specifies the number of nodes required for the job. The output of this command should show four distinct hostnames of compute nodes in the cluster.

Using sinfo displays information about partitions and their current status as well as the allocation of nodes within those partitions in various states. Ideally some number of nodes are idle and can be used to debug. Any node with an asterisk * next to it is failing to communicate with the controller. In an ideal scenario, no nodes should be in a state like idle*, alloc*, drain*, and so on. If a partition shows UP and has available nodes, the partition is likely functioning correctly and is ready to accept jobs. Use the srun hostname command to confirm.

To submit a test batch job to the cluster:

Write a sample job file and mark it as executable. A sample job is as simple as the following:

#!/bin/bash
 env # Will print out the job submission environment
sleep 300 # Sleep for 5 minutes

Submit the job by running:
```
sbatch test.sh
```
This should return a successful job submission and notify you of the Job ID.
Check the status of the job by running:
```
squeue
```
This should show you your submitted job as well as its status. On an idle cluster, the job should be accepted immediately upon submission and show an “R” in the state to denote running. After the 5 minute sleep, the output of the job should show in your user directory as a Slurm job file that contains the output of the job. In the sample job, this is the output of the command env.