Monitor Node Remediation#
The ideal node issue remediation scenario is described in Automated Remediation Example. If, however, the automated remediation plan does not fully resolve the issue, the ICE ClusterWare ™ software moves the compute node to the Work Queue state.
Review the journalctl logs and determine which remediation plan executed. Then, attempt the other recommended solutions manually or run additional tests on the node to identify the root cause of the issue.
View the log files, including which remediation plan was selected for a node, by running the following on the head node where automated remediation service (ARS) was configured:
journalctl -u ars-auto-remediation
For example, if Sensu detects check_ping, check_slurmd, and
check_mount_availability, the ClusterWare software evaluates multiple
remediation plans for impact, severity, and confidence. Reviewing the log file
shows output similar to the following:
Jan 29 20:08:41 head1 ars-auto-remedi[2659212]: INFO [ars_auto_remediation.condition_detectors]
[condition_detectors.py:get_failing_checks:217] Following Sensu checks are failing
for node n1: ['check_ping','check_slurmd','check_mount_availability']
Jan 29 20:08:41 head1 ars-auto-remediation[2659212]: INFO [ars_auto_remediation.autoremediation]
[autoremediation.py:_choose_suggestion:387] Selected remediation suggestion:
node_reprovision with confidence 0.9 for node n1
In this case, the ClusterWare software reprovisions the node because that remediation plan has the least impact to the overall cluster and the highest chance to solve all three issues. However, if reprovisioning the node does not solve the problem, a cluster administrator can try the other proposed solutions: rebooting the node, restarting Slurm, and mounting filesystems.
You can view the list of available checks and associated remediation plans by running the following on the head node where automated remediation was configured:
[admin@head1]$ /opt/scyld/clusterware-ars/env/bin/python3.12 -m ars_auto_remediation.explorer print --format table
Check Severity Impact Suggested Plans Confidence
check_ping 1.0 1.0 node_reprovision 0.9
check_slurmd 0.2 0.1 restart_slurmd 0.85
check_mount_availability 0.2 0.1 mount_filesystems 0.8
node_reboot 0.9
check_configuration 0.5 0.4 run_monolithic_playbook 0.95
check_system_errors 0.8 0.7 no_op_plan 0.5
check_fs_capacity 0.6 0.5 no_op_plan 0.5
check_systemd_failed_units 0.5 0.4 no_op_plan 0.5
check_ntp 0.4 0.3 no_op_plan 0.5
check_zombie 0.3 0.2 no_op_plan 0.5
check_ethlink 0.7 0.6 no_op_plan 0.5
Note
The example health checks and suggested plans in this example include the default checks configured by the ClusterWare software and some additional checks. See Default Node Health Checks for details.
There is also an interactive tool to see the selected solutions for multiple issues by running:
[admin@head1]$ /opt/scyld/clusterware-ars/env/bin/python3.12 -m ars_auto_remediation.explorer test
Checks Sorted Suggestions
[X] check_ping node_reprovision
[X] check_slurmd node_reboot
[X] check_mounts restart_slurmd
[ ] check_configuration mount_filesystems
[ ] check_fs_space
[ ] check_systemd_failed_units
[ ] check_ntp
[ ] check_zombie
[ ] check_ethlink