Monitor Node Remediation and Review Work Queue#

The ideal node issue remediation scenario is described in Auto Remediation Example. If, however, the applied remedy does not fully resolve the issue, the ClusterWareAI ™ software moves the compute node to the Work Queue state and requires administrator action. You can see which nodes are in the Work Queue state on the ARS Overview Page or ARS State Machine Page of the ClusterWareAI GUI or by running:

cw-nodectl --selector '_ars_state=="work_queue"' ls

Review the journalctl or MQTT logs and determine which remedy executed. Then, attempt the other recommended solutions manually or run additional tests on the node to identify the root cause of the issue.

For example, view the log files, including which remedy was selected for a compute node, by running the following on the parent head node:

journalctl --unit ars-auto-remediation

If multiple health checks fail, the ClusterWareAI software evaluates multiple remedies for impact, severity, and confidence. Reviewing the log file shows output similar to the following:

Jan 29 20:08:41 head1 ars-auto-remedi[2659212]: INFO [ars_auto_remediation.condition_detectors]
    [condition_detectors.py:get_failing_checks:217] Following checks are failing
      for node n1: ['check_ping','check_slurmd','check_mount_availability']
Jan 29 20:08:41 head1 ars-auto-remediation[2659212]: INFO [ars_auto_remediation.autoremediation]
    [autoremediation.py:_choose_suggestion:387] Selected remediation suggestion:
      node_reprovision with confidence 0.9 for node n1

In this case, the ClusterWareAI software reprovisions the node because that remedy has the least impact to the overall cluster and the highest chance to solve all three issues. However, if reprovisioning the node does not solve the problem, try the other proposed solutions: rebooting the node, restarting Slurm, and mounting filesystems. After attempting other solutions, move the node to the Provisioning state. See Move Nodes to and from Work Queue for details.