Configure ClusterWareAI Health Monitoring System and Auto Remediation Service#

The ClusterWareAI ™ Health Monitoring System (CHMS) and Auto Remediation Service (ARS) are installed and enabled on all head nodes by default when you run the cw-install script. However, additional compute node configuration is required before you can monitor and automatically remediate compute nodes in your cluster.

Note

If your cluster is in an air-gapped environment, contact Penguin Computing before you begin ARS configuration.

Enable CHMS on Compute Nodes#

You can enable compute node health checks with or without enabling ARS. There are a set of default health checks available or you can create your own.

Add health checks to the compute nodes you want to monitor using the _ars_checks reserved attribute. See Add Health Checks to Nodes for details.

Tip

Many of the default health checks have an associated label used to group checks that are commonly run together. For example, the gpu-nvidia label has health checks specific to NVIDIA GPUs.

Enable ARS on Compute Nodes#

Important

Enabling automated remediation on administration nodes, such as a cluster login node, is not recommended. Some of the automated remediation solutions can take the node offline and, if set on an administration node, could interrupt cluster availability.

Enable ARS on the nodes you want to the ClusterWareAI software to automatically remediate if health checks fail.

All compute nodes must have a valid power URI configuration. See Compute Node Power Control for details.
Set the _ars_state reserved attribute to new_node.
After the attribute is set, the compute nodes should power on (if previously powered off), move to the Provisioning state, and start running assigned health checks. You can check the node state via the ARS State Machine Page in the ClusterWareAI GUI or using the --ars argument:
```
[admin@head1 ~]$ cw-nodectl -i n[0-15] status --ars
n[0-1] available
n[2-15] provisioning
```

Add Workload Scheduler to ARS-monitored Compute Nodes#

If you are using a workload scheduler, such as Slurm or Kubernetes, ARS communicates with the workload scheduler to drain jobs from compute nodes if a health check fails. Once the health issue is resolved, ARS communicates with the workload scheduler again so the node can resume work.

To allow ARS to communicate with your workload scheduler:

If you are using Slurm, complete the configuration steps in Configure Auto Remediation Service (ARS) with Slurm.
If you are using Kubernetes:
1. Match the node name used by Kubernetes to the hostname and domain used by the ClusterWareAI software. See Configure Kubernetes with the Node Package on Operating System or Configure Kubernetes with the ClusterWareAI Container Registry for configuration steps.
2. If your compute nodes use the clusterware-node container on a RHEL-derived OS with SELinux in enforcing mode, the container may need additional SELinux permissions. Contact Penguin Computing for assistance.
Use the _ars_scheduler reserved attribute to specify the workload scheduler managing each node. Values include:
- none: no workload scheduler is used for the compute node. Use this if the ClusterWareAI software does not need to send commands to a workload scheduler before moving the node between ARS states.
- slurm: Slurm is used to schedule workloads. If the node fails a health check and requires remediation, the ClusterWareAI software moves the node to the Draining state and sends a drain request to Slurm. When Slurm responds that the drain succeeded, the node moves to the Drained state. When the remediation succeeds and the node moves from Provisioning to Available, the ClusterWareAI software sends a resume request to Slurm.
- kube:<provider UID>: Kubernetes is used to schedule workloads. You can have one or more kubevirt providers, differentiated by provider UID. If the node fails a health check and requires remediation, the ClusterWareAI software moves the node to the Draining state and sends a drain request to the kubevirt provider. When the drain succeeds, the node moves to the Drained state. When the remediation succeeds and the node moves from Provisioning to Available, the ClusterWareAI software sends a resume request to the kubevirt provider.
You can have multiple scheduler types in your cluster, but each compute node can only have one scheduler assigned via the attribute. If you do not set the attribute on a node with checks assigned, it is the same as setting the value to none.

Tip

In most clusters, multiple compute nodes use the same workload scheduler, so you can set the _ars_scheduler reserved attribute in an attribute group and add the nodes to that group.

Test CHMS and ARS Configuration#

Use the ArsPanic default health check to test your CHMS and ARS configuration to make sure all required services are running and that compute nodes are moving through the ARS state map appropriately.

On a test compute node with a valid power URI configuration:

Add the ArsPanic health check, enable ARS, and specify the workload scheduler:

cw-nodectl -i n0 set _ars_checks=ArsPanic _ars_state=new_node _ars_scheduler=none

Wait until the node is in the Available state. You can check using the ARS State Machine page of the ClusterWareAI GUI or by running:
```
cw-nodectl -i n0 status --ars
```
Create a panic file on the compute node so that the ArsPanic health check fails:
```
cw-nodectl -i n0 ssh touch /panic
```
Monitor the ARS log files to watch the node move to the Auto Remediation state and reboot to resolve the issue. Logging is available via MQTT or journalctl. See ARS Logging for details.

Alternatively, use the ARS State Machine and ARS Node Details pages of the ClusterWareAI GUI to monitor the node's auto remediation.

Eventually the compute node should return to the Available state after auto remediation completes and the node is re-checked during the Provisioning state. If you encounter errors, there may be an issue with your CHMS or ARS configuration.

After successfully testing your configuration, you can remove the ArsPanic health check and assign appropriate health checks to the node for your production environment.

Example: Configure CHMS and ARS on Slurm Compute Nodes#

The following example shows how to set up CHMS and ARS on 10 compute nodes that use Slurm as a job scheduler. The compute nodes have the same bare metal hardware configuration and therefore should run the same set of health checks.

Set power_uri with the appropriate BMC IP address and username/password access credentials on all nodes. For example, on node n0:
```
cw-nodectl -i n0 update power_uri=ipmi:///admin:password@172.45.88.1
```
Create a new attribute group:
```
cw-attribctl create name=ArsSlurmNodes
```
Set the appropriate reserved attributes on the new attribute group. The reserved attributes enable ARS on the nodes, specify Slurm as the workload scheduler, and set a list of health checks to run on the nodes.
```
cw-attribctl -i ArsSlurmNodes set _ars_state=new_node _ars_scheduler=slurm \
  _ars_checks=%default,%slurm,MemoryConfiguration
```
In this example, the health checks are set via labels (default and slurm) and as an individual check (MemoryConfiguration). If health checks add or remove the labels, the list of checks automatically updates on all nodes.
Join the compute nodes to the attribute group:
```
cw-nodectl -i n[0-9] join ArsSlurmNodes
```

Verify the compute nodes are moving between ARS states:

[admin@head1 ~]$ cw-nodectl -i n[0-9] status --ars
n[0-1] available
n[2-7] provisioning
n[8-9] new_node