Auto Remediation Service (ARS)#

The ClusterWareAI ™ software uses the Auto Remediation Service (ARS) to implement automated remediation solutions to common cluster node problems, such as GPU errors, out of memory errors, or mount issues. ARS integrates with the ClusterWareAI Health Monitoring System (CHMS), which includes default health checks to identify common node issues. ARS identifies the most statistically advisable solution based on the issue or issues reported by the health checks and attempts to resolve the problem without human intervention. Depending on the issue, this can be as simple as running a command or rebooting the node. In many cases the issue is fully resolved after the remedy is applied, reducing overhead for the cluster administrator and improving hardware uptime and cluster performance. If the automated remedy cannot solve the node's problem, such as a hardware failure that requires replacement, the node is moved into a Work Queue and cluster administrator attention is required.

ARS is installed by default with the ClusterWareAI software, but requires some manual configuration to fully enable automated node health tracking and remediation. After initial configuration, all compute nodes with the correct attributes are monitored by ARS. See Configure ClusterWareAI Health Monitoring System and Auto Remediation Service for details.

Note

Enabling automated remediation on infrastructure nodes, such as a cluster login node, is not recommended. Some of the automated remediation solutions can take the node offline and, if set on an infrastructure node, could interrupt cluster availability. You can still run health checks on infrastructure nodes and monitor logging to identify and manually resolve issues. See Health Check Logging for details.

Use ARS for visibility into real-time cluster operational status as well as historical logs of automated actions. Charts showing recent ARS trends and events are available in the ClusterWareAI GUI on the ARS Overview Page with additional details on other pages.

ARS integrates with Slurm and Kubernetes workload schedulers.