Monitor Node Remediation#

The ideal node issue remediation scenario is described in Automated Remediation Example. If, however, the automated remediation plan does not fully resolve the issue, the ICE ClusterWare ™ software moves the compute node to the Work Queue state.

Review the journalctl logs and determine which remediation plan executed. Then, attempt the other recommended solutions manually or run additional tests on the node to identify the root cause of the issue.

View the log files, including which remediation plan was selected for a node, by running the following on the head node where automated remediation service (ARS) was configured:

journalctl -u ars-auto-remediation

For example, if Sensu detects check_ping, check_slurmd, and check_mount_availability, the ClusterWare software evaluates multiple remediation plans for impact, severity, and confidence. Reviewing the log file shows output similar to the following:

Jan 29 20:08:41 head1 ars-auto-remedi[2659212]: INFO [ars_auto_remediation.condition_detectors]
    [condition_detectors.py:get_failing_checks:217] Following Sensu checks are failing
      for node n1: ['check_ping','check_slurmd','check_mount_availability']
Jan 29 20:08:41 head1 ars-auto-remediation[2659212]: INFO [ars_auto_remediation.autoremediation]
    [autoremediation.py:_choose_suggestion:387] Selected remediation suggestion:
      node_reprovision with confidence 0.9 for node n1

In this case, the ClusterWare software reprovisions the node because that remediation plan has the least impact to the overall cluster and the highest chance to solve all three issues. However, if reprovisioning the node does not solve the problem, a cluster administrator can try the other proposed solutions: rebooting the node, restarting Slurm, and mounting filesystems.

You can view the list of available checks and associated remediation plans by running the following on the head node where automated remediation was configured:

[admin@head1]$ /opt/scyld/clusterware-ars/env/bin/python3.12 -m ars_auto_remediation.explorer print --format table

 Check                       Severity  Impact  Suggested Plans          Confidence
 check_ping                  1.0       1.0     node_reprovision            0.9
 check_slurmd                0.2       0.1     restart_slurmd              0.85
 check_mount_availability    0.2       0.1     mount_filesystems           0.8
                                               node_reboot                 0.9
 check_configuration         0.5       0.4     run_monolithic_playbook     0.95
 check_system_errors         0.8       0.7     no_op_plan                  0.5
 check_fs_capacity           0.6       0.5     no_op_plan                  0.5
 check_systemd_failed_units  0.5       0.4     no_op_plan                  0.5
 check_ntp                   0.4       0.3     no_op_plan                  0.5
 check_zombie                0.3       0.2     no_op_plan                  0.5
 check_ethlink               0.7       0.6     no_op_plan                  0.5

Note

The example health checks and suggested plans in this example include the default checks configured by the ClusterWare software and some additional checks. See Default Node Health Checks for details.

There is also an interactive tool to see the selected solutions for multiple issues by running:

[admin@head1]$ /opt/scyld/clusterware-ars/env/bin/python3.12 -m ars_auto_remediation.explorer test

Checks                               Sorted Suggestions
[X] check_ping                       node_reprovision
[X] check_slurmd                     node_reboot
[X] check_mounts                     restart_slurmd
[ ] check_configuration              mount_filesystems
[ ] check_fs_space
[ ] check_systemd_failed_units
[ ] check_ntp
[ ] check_zombie
[ ] check_ethlink