ClusterWareAI Health Monitoring System (CHMS)#

Cluster administrators often identify compute nodes that are out of compliance in some way and execute actions to solve such issues. The ClusterWareAI ™ software includes a set of default health checks as part of the ClusterWareAI Health Monitoring System (CHMS) to monitor compute nodes for common issues such as system errors, zombie tasks, and network connectivity problems and alert you to those issues so they can be resolved quickly. You can also create your own health checks specific to your cluster configuration and add them to the ClusterWareAI software. In addition to running the health checks, you can enable the auto remediation service (ARS) and remediation state machine (RSM). When used together, CHMS identifies compute nodes with health issues, RSM drains the nodes and removes them from active use, and ARS attempts to fix the issues automatically before RSM reprovisions the nodes and returns them to the active cluster.