Auto Remediation Example#

A Slurm-based ClusterWareAI ™ compute node is configured to run some of the default health checks and is currently in the Available state of the ARS State Map. The InternetConnectivity health check fails during one of the scheduled runs and the following automated process begins:

The compute node sends the health check failure information to the parent head node via a REST API.
The InternetConnectivity health check uses the default flap threshold value of 2. If this is the first time the node has failed this check, the head node waits to see if another failure occurs. If it fails twice, the head node sends instructions to Slurm to drain the node as soon as the active job finishes.
The node completes the job and moves through the Draining and Drained states.
After the node is fully drained, it moves to the Auto Remediation state where the ClusterWareAI software reviews the health check failure and evaluates potential remedies. In this case, the selected remedy is to restart the node.
The ClusterWareAI software restarts the node and moves the node to the Provisioning state to be re-tested.
The health checks pass and the node moves back to the Available state and can pick up a new job.

Only a single error was detected in this simple example. However, if more than one health check failed, then ClusterWareAI considers multiple remedies. The different remedies are evaluated for severity of the issue, impact to cluster operation, and confidence that they are effective to resolve all of the node health problems. See Monitor Node Remediation and Review Work Queue to learn more.