Configure Job Schedulers#

The ClusterWareAI ™ software supports integration with the Slurm and OpenPBS job schedulers. Job schedulers are used to manage and run workloads requested by multiple users. The job scheduler grants each user a portion of the cluster resources for a given time to run their job.

Slurm is designed for batch workloads, such as AI training, scientific simulations, or data analysis tasks. Slurm is fast and easy to use when a workload has a defined ending and when you know the scope of resources required for the job. Slurm can also plan for future resource needs and schedule jobs based on availability in the future, which optimizes cluster usage over time.

The ClusterWareAI software also supports integration with Kubernetes. Kubernetes excels at scaling applications to handle high traffic or large datasets for workloads that run continuously. Kubernetes could be a good choice for AI inference, for example, whereas Slurm could be a good choice for AI training. See Configure Kubernetes for details.

The default ClusterWareAI installation for RHEL/CentOS includes the Slurm and OpenPBS (RHEL/CentOS 8 only) packages. It is recommended to install these optional packages on a separate administrative node or scheduler server from the ClusterWareAI head node. While the two schedulers can coexist on the same server, only one can be enabled and executing on the server at any time.

Note

The ClusterWareAI software no longer ships with PBS TORQUE. However, PBS TORQUE is available in older ClusterWareAI EL7 packages, which are no longer updated. While PBS TORQUE is no longer tested, you should be able to run it with your ClusterWareAI cluster.

Job Scheduler Configuration Prerequisites#

Complete the following steps before installing and configuring Slurm or OpenPBS to work with the ClusterWareAI platform.

Resolve Job Scheduler Hostname#

All nodes in the job scheduler cluster must be able to resolve hostnames of all other nodes as well as the scheduler server hostname. The ClusterWareAI platform provides a DNS server in the clusterware-dnsmasq package, as discussed in Node Name Resolution. This dnsmasq resolves all compute node hostnames.

Add the job scheduler's hostname to /etc/hosts on the head node(s) to be resolved by dnsmasq.
Restart the clusterware-dnsmasq service after editing /etc/hosts by running:
```
sudo systemctl restart clusterware-dnsmasq
```

Create Job Scheduler Image#

Installing and configuring a job scheduler requires making changes to the compute node software. When using image-based compute nodes, clone the DefaultImage to create a new image, leaving untouched the DefaultImage as a basic known-functional pristine image.

For example, to set up nodes n0 through n3:

Clone the default image:

cw-imgctl -i DefaultImage clone name=jobschedImage

Clone the default boot configuration and add the new image to the new boot configuration:
```
cw-bootctl -i DefaultBoot clone name=jobschedBoot image=jobschedImage
```
Set the boot configuration with the new image on nodes n0-n3:
```
cw-nodectl -i n[0-3] set _boot_config=jobschedBoot
```

When these nodes reboot after all the Slurm- or OpenPBS-specific setup steps are complete, they will use the jobschedBoot and jobschedImage.

The following sections describe the installation and configuration of each job scheduler type.

Configure Slurm#

Tip

See https://slurm.schedmd.com/faq.html#torque for useful information about how to transition from OpenPBS or PBS TORQUE to Slurm.

The default ClusterWareAI Slurm configuration is configless and uses dynamic Slurm nodes. This reduces the admin effort needed when updating the list of compute nodes. See https://slurm.schedmd.com/configless_slurm.html and https://slurm.schedmd.com/dynamic_nodes.html for more information.

Alternatively, you can also configure ClusterWareAI and Slurm to use static nodes or a combination of static and dynamic nodes.

Dynamic Slurm Nodes (default): When new nodes are added to a ClusterWareAI cluster and booted with a Slurm image, they are automatically added to Slurm. Dynamic nodes are not automatically removed from Slurm scontrol, even if the node is removed or changed within the ClusterWareAI platform.
Static Slurm Nodes: Static nodes need to be manually configured to be added to Slurm. Static Slurm nodes were the default prior to the ClusterWareAI 13.0 release.
Mix of Dynamic and Static Slurm Nodes: You can use a mix of dynamic and static nodes. Dynamic and static nodes can use the same Slurm image.

Configless Slurm is enabled with "SlurmctldParameters=enable_configless" in /etc/slurm/slurm.conf and a DNS SRV record called slurmctld_primary is created. To see the details about the SRV record, run:

cw-clusterctl hosts -i slurmctld_primary ls -l

For clusters with a backup Slurm controller, create a slurmctld_backup DNS SRV record:

cw-clusterctl --hidden hosts create name=slurmctld_backup port=6817 \
    service=slurmctld domain=cluster.local target=backuphostname \
      type=srvrec priority=20

Install Slurm#

Complete the job scheduler configuration prerequisites.
Install Slurm software on the job scheduler controller.
- For RHEL/CentOS 8:
```
sudo dnf install slurm-cw --enablerepo=cw* --enablerepo=cw* --enablerepo=powertools
```
- For RHEL/CentOS 9 and 10:
```
sudo dnf install slurm-cw --enablerepo=cw* --enablerepo=cw* --enablerepo=crb
```
Note

An additional RPM package, slurm-cw-slurmrestd, is available. See https://slurm.schedmd.com/slurmrestd.html for details. The slurm-cw-slurmrestd package is not installed by default. To install the package, run dnf --enablerepo=cw* --enablerepo=cw* install slurm-cw-slurmrestd.
Configure either dynamic (default) or static Slurm nodes.

Configure with Dynamic Slurm Nodes#

ClusterWareAI with Slurm uses dynamic Slurm nodes by default.

Use a helper script slurm-cw.setup to complete the initialization and install the Slurm RPMs on the controller. You must have ClusterWareAI administrator permissions to run this command.
```
slurm-cw.setup init
```
init generates /etc/slurm/slurm.conf, /etc/slurm/cgroup.conf, and /etc/slurm/slurmdbd.conf, starts munge, slurmctld, mariadb, and slurmdbd, and restarts slurmctld.
For diskless nodes only: Set up the boot configuration and Slurm image and apply them to the compute nodes.
1. Update the image you created during the prerequisite steps to include Slurm installation and configuration details:
```
slurm-cw.setup update-image <slurm image>
```
  Where <slurm image> is replaced by the name of the image file you created during the prerequisite steps.
2. Reboot the compute notes for the image changes to take effect:
```
cw-nodectl -i <node list> reboot
```
  Where <node list> is replaced by a list of nodes. For example, n[0-3,14,17-22].
  
  After reboot, the nodes with the Slurm Image applied automatically join as dynamic Slurm nodes.
For diskful nodes only: Install Slurm on the nodes and reboot. The nodes will automatically join as dynamic Slurm nodes. For example, one option is to add the nodes as static nodes and then remove them from the slurm.conf file after initialization.
Check the Slurm status to ensure all expected nodes are listed:
```
slurm-cw.setup status
```

To avoid adding nodes as a dynamic Slurm node, set the _slurmd=NoDynamic reserved attribute. Setting the _slurmd reserved attribute does not impact static Slurm nodes. For example, to set on node n1:

cw-nodectl -i n1 set _slurmd=NoDynamic

Configure with Static Slurm Nodes#

Unlike with dynamic Slurm nodes, static Slurm nodes need to be added to Slurm explicitly.

Use a helper script slurm-cw.setup to complete the initialization, install the Slurm RPMs on the controller, and run slurmd on specified nodes. You must have ClusterWareAI administrator permissions to run this command.
```
slurm-cw.setup init <nodes>
```
Where <nodes> is replaced by:
- All “up” nodes: --up
- A list of nodes: -i n[0-2]
- An expression attribute: -s 'attributes[_boot_config]=="DefaultBoot"'
init generates /etc/slurm/slurm.conf, /etc/slurm/cgroup.conf, and /etc/slurm/slurmdbd.conf, starts munge, slurmctld, mariadb, and slurmdbd, and restarts slurmctld. Next, init tries to install slurm-cw-node on the selected live nodes. After that installation succeeds, the slurm-cw.setup script starts slurmd on the selected live nodes and those nodes are added to /etc/slurm/slurm.conf as static Slurm nodes.
For diskless nodes only: Reboot the nodes with the Slurm image.

Note

These steps are not required for diskful nodes as Slurm is installed directly on the disk via slurm-cw.setup init.
1. Update the image you created during the prerequisite steps to include Slurm installation and configuration details:
```
slurm-cw.setup update-image <slurm image>
```
  Where <slurm image> is replaced by the name of the image file you created during the prerequisite steps.
2. Reboot the compute notes for the image changes to take effect:
```
cw-nodectl -i <node list> reboot
```
  Where <node list> is replaced by a list of nodes. For example, n[0-3,14,17-22].
Check the Slurm status to ensure all expected nodes are listed:
```
slurm-cw.setup status
```

Configure Auto Remediation Service (ARS) with Slurm#

If you are enabling ARS, complete the following steps:

Copy valid slurm.conf and munge.key files from the Slurm scheduler controller to all head nodes as /etc/slurm/slurm.conf and /etc/munge/munge.key.
Open the firewall for each head node IP address on the Slurm scheduler controllers (primary and secondary).
Find all slurmctld hosts hostnames defined in the SlurmctldHost= lines of /etc/slurm/slurm.conf.
Add the slurmctld hosts IP addresses and hostnames to /etc/hosts on all head nodes using the following format:
```
<IP Address> <hostname>
```
For example:
```
10.110.1.1 slurmcontrol-primary
10.110.1.2 slurmcontrol-secondary
```

Modify the ARS state machine container quadlet on all head nodes.

On a RHEL or Rocky 9 head node, create the drop-in slurm.conf file for the ARS state machine container quadlet on all head nodes:

sudo cp /etc/containers/systemd/ars-state-machine.container.d/slurm.conf.example \
     /etc/containers/systemd/ars-state-machine.container.d/slurm.conf

On a RHEL or Rocky 8 head node:

Copy the following lines from the /etc/containers/systemd/ars-state-machine.container.d/slurm.conf.example file:

Volume=/etc/munge/munge.key:/run/secrets/munge.key:ro,z
Volume=/etc/slurm/slurm.conf:/run/secrets/slurm.conf:ro,z

Add the copied lines to the /etc/containers/systemd/ars-state-machine.container file. For example:

[Unit]
Description=ARS State Machine

[Container]
Image=cw-embedded-registry.internal/ars-state-machine
Volume=/opt/scyld/clusterware/workspace/sys-ars-settings.ini:/root/.scyldcw/settings.ini:ro,z
Volume=/etc/munge/munge.key:/run/secrets/munge.key:ro,z
Volume=/etc/slurm/slurm.conf:/run/secrets/slurm.conf:ro,z
Pull=missing

[Service]
Restart=always
RestartSec=30s

[Install]
WantedBy=multi-user.target

Reload the daemon and restart the ARS state machine service:

sudo systemctl daemon-reload
sudo systemctl restart ars-state-machine

Follow the steps in Add Workload Scheduler to ARS-monitored Compute Nodes to complete Slurm configuration with ARS.

Note

If you stop using Slurm, remove the drop-in slurm.conf file or the inserted lines from the ARS state machine quadlet container on all head nodes, then reload and restart the ARS state machine service by running:

sudo rm -f /etc/containers/systemd/ars-state-machine.container.d/slurm.conf
sudo systemctl daemon-reload
sudo systemctl restart ars-state-machine

Work with Slurm#

When a node boots, the ClusterWareAI script boots nodes configured in slurm.conf statically and those not configured in slurm.conf dynamically. If, however, the _slurmd reserved attribute is set to NoDynamic, ClusterWareAI will not attempt to boot the node as a dynamic Slurm node. Setting the _slurmd reserved attribute to NoDynamic has no impact on static Slurm nodes.

You can view the Slurm status on the server and compute nodes by running:

slurm-cw.setup status

Start and stop the Slurm service cluster-wide by running:

slurm-cw.setup cluster-stop
slurm-cw.setup cluster-start

Slurm User Access#

Slurm executable commands and libraries are installed in /opt/scyld/slurm/. The Slurm controller configuration can be found in /etc/slurm/slurm.conf and each configless node caches a copy of that slurm.conf file in /var/spool/slurmd/conf-cache/.

You can inject users into the compute node image using the sync-uids script. You can inject all users, a selected list of users, or a single user. For example, inject the single user janedoe:

/opt/scyld/clusterware-tools/bin/sync-uids \
              -i slurmImage --create-homes \
              --users janedoe --sync-key janedoe=/home/janedoe/.ssh/id_rsa.pub

See Configure Administrator Authentication and /opt/scyld/clusterware-tools/bin/sync-uids -h for details.

Each Slurm user must set up the PATH and LD_LIBRARY_PATH environment variables to properly access the Slurm commands. This is done automatically for users who log in when Slurm is running via the /etc/profile.d/cw.slurm.sh script. Alternatively, each Slurm user can manually execute module load Slurm or can add that command line to (for example) the user's ~/.bash_profile or ~/.bashrc.

Work with Slurm Configuration File#

After initialization, you can manually edit the Slurm configuration file /etc/slurm/slurm.conf to add or remove static Slurm nodes.

You can also use the slurm.conf file to add or remove partitions to set up alternative queues for nodes.

You can generate a new Slurm configuration file for specified nodes without reconfiguring the database or controller. Generating a new configuration file is not common as it resets the Slurm configuration.

slurm-cw.setup reconfigure  <nodes>

Where <nodes> is replaced by:

All “up” nodes: --up
A list of nodes: -i n[0-2]
An expression attribute: -s 'attributes[_boot_config]=="DefaultBoot"'

Manage Slurm#

See Manage Slurm for details about adding and removing nodes, troubleshooting Slurm, and common Slurm commands.

Configure OpenPBS#

OpenPBS is only available for RHEL/CentOS 8 clusters.

See Job Schedulers for general job scheduler information and configuration guidelines. See https://www.openpbs.org for OpenPBS documentation.

First install OpenPBS software on the job scheduler server:

sudo dnf install openpbs-scyld --enablerepo=cw* --enablerepo=scyld*

Use a helper script to complete the initialization and setup the job scheduler and config file in the compute node image(s).

Note

The openpbs-scyld.setup script performs the init, reconfigure, and update-nodes actions (described below) by default against all up nodes. Those actions optionally accept a node-specific argument using the syntax [--ids|-i <NODES>] or a group-specific argument using [--ids|-i %<GROUP>]. See Attribute Groups for details.

openpbs-scyld.setup init                      # default to all 'up' nodes
openpbs-scyld.setup update-image openpbsImage # for permanence in the image

Reboot the compute nodes to bring them into active management by OpenPBS. Check the OpenPBS status:

openpbs-scyld.setup status

# If the OpenPBS daemon is not executing, then:
openpbs-scyld.setup cluster-restart

# And check the status again

This cluster-restart is a manual one-time setup that doesn't affect the openpbsImage. The update-image is necessary for persistence across compute node reboots.

Generate new openpbs-specific config files with:

openpbs-scyld.setup reconfigure      # default to all 'up' nodes

Add nodes by executing:

openpbs-scyld.setup update-nodes     # default to all 'up' nodes

or add or remove nodes by executing qmgr.

Any such changes must be added to openpbsImage by reexecuting:

openpbs-scyld.setup update-image openpbsImage

and then either reboot all the compute nodes with that updated image, or additional execute:

openpbs-scyld.setup cluster-restart

to manually push the changes to the up nodes without requiring a reboot.

Inject users into the compute node image using the sync-uids script. The administrator can inject all users, or a selected list of users, or a single user. For example, inject the single user janedoe:

/opt/scyld/clusterware-tools/bin/sync-uids \
              -i openpbsImage --create-homes \
              --users janedoe --sync-key janedoe=/home/janedoe/.ssh/id_rsa.pub

See Configure Administrator Authentication and /opt/scyld/clusterware-tools/bin/sync-uids -h for details.

To view the OpenPBS status on the server and compute nodes:

openpbs-scyld.setup status

The OpenPBS service can also be started and stopped cluster-wide with:

openpbs-scyld.setup cluster-stop
openpbs-scyld.setup cluster-start

OpenPBS executable commands and libraries are installed in /opt/scyld/openpbs/. Each OpenPBS user must set up the PATH and LD_LIBRARY_PATH environment variables to properly access the OpenPBS commands. This is done automatically for users who login when OpenPBS is running via the /etc/profile.d/scyld.openpbs.sh script. Alternatively, each OpenPBS user can manually execute module load openpbs or can add that command line to (for example) the user's ~/.bash_profile or ~/.bashrc.