Add Health Checks to Nodes#
Health Checks run on compute nodes that have either the health check name or health check label added to the node. If you are using a kubevirt provider or virtual machines for compute nodes, you can use health checks to detect health issues on the associated bare metal hardware.
Use the _ars_checks reserved attribute to add health checks to a node. Use a percent sign (%) prefix when adding a health check label. The prefix is not needed when adding a health check by name. For example:
To add a single health check to a node:
cw-nodectl -i n0 set _ars_checks=PciDeviceInventory
To add a single health check and a label to a node, where
PciDeviceInventoryis the health check name anddefaultis the label:cw-nodectl -i n1 set _ars_checks=PciDeviceInventory,%default
To add a health check label to multiple nodes:
cw-nodectl -i n9-14 set _ars_checks=%gpu
Tip
When assigning health checks to compute nodes, make sure that the health check is appropriate for that node to avoid false failures. For example, if you add the Slurmd health check to a Kubernetes node, the check fails because it cannot detect that slurmd is running. The node would eventually end up in the Work Queue when it cannot be auto remediated.
When you add a new health check to a node, all health checks assigned to the node run once and then run at their scheduled intervals. The same happens if a health check is added to a label assigned to a node or if the health check's interval changes and the health check schedule on the node updates. To avoid overloading the compute node during Provisioning, the health checks start at random delays of 0-10 seconds.
If you created a custom health check, the health check only runs on compute
nodes that have the custom health check script available in the location specified
in the command field of the health check. See Custom Health Checks for
details.
Examples: Update Health Checks on Compute Node#
To add more health checks to a node:
Use the
cw-nodectlorcw-healthctltool to view the list of health checks and labels assigned to a node. For example, the node n0 has a single health check assigned:[admin@head0 ~]$ cw-nodectl -i n0 --fields attributes._ars_checks ls -l Nodes n0 attributes _ars_checks: PciDeviceInventoryUse the
cw-healthctltool to view the list of available health checks:[admin@head0 ~]$ cw-healthctl ls Health Checks MountAvailability NtpSynchronization PciDeviceInventory SystemdServices
Add another health check to the compute node:
cw-nodectl -i n0 set _ars_checks=PciDeviceInventory,NtpSynchronization
To add health checks to a node that has checks assigned via labels:
View the list of health checks and labels assigned to a node:
cw-healthctl list --nodes n1 --show-labels
List all checks with a specific label to make sure the check you want to add isn't already included in the label:
[admin@head0 ~]$ cw-healthctl -i %infiniband ls Health Checks InfiniBandLinkState InfiniBandPortErrors
Add the new health check or health check label to the node:
cw-nodectl -i n0 set _ars_checks=%infiniband,MountAvailability,SystemEventLog