Add Health Checks to Nodes#

Health Checks run on compute nodes that have either the health check name or health check label added to the node. If you are using a kubevirt provider or virtual machines for compute nodes, you can use health checks to detect health issues on the associated bare metal hardware.

Use the _ars_checks reserved attribute to add health checks to a node. Use a percent sign (%) prefix when adding a health check label. The prefix is not needed when adding a health check by name. For example:

  • To add a single health check to a node:

    cw-nodectl -i n0 set _ars_checks=PciDeviceInventory
    
  • To add a single health check and a label to a node, where PciDeviceInventory is the health check name and default is the label:

    cw-nodectl -i n1 set _ars_checks=PciDeviceInventory,%default
    
  • To add a health check label to multiple nodes:

    cw-nodectl -i n9-14 set _ars_checks=%gpu
    

Tip

When assigning health checks to compute nodes, make sure that the health check is appropriate for that node to avoid false failures. For example, if you add the Slurmd health check to a Kubernetes node, the check fails because it cannot detect that slurmd is running. The node would eventually end up in the Work Queue when it cannot be auto remediated.

When you add a new health check to a node, all health checks assigned to the node run once and then run at their scheduled intervals. The same happens if a health check is added to a label assigned to a node or if the health check's interval changes and the health check schedule on the node updates. To avoid overloading the compute node during Provisioning, the health checks start at random delays of 0-10 seconds.

If you created a custom health check, the health check only runs on compute nodes that have the custom health check script available in the location specified in the command field of the health check. See Custom Health Checks for details.

Examples: Update Health Checks on Compute Node#

To add more health checks to a node:

  1. Use the cw-nodectl or cw-healthctl tool to view the list of health checks and labels assigned to a node. For example, the node n0 has a single health check assigned:

    [admin@head0 ~]$ cw-nodectl -i n0 --fields attributes._ars_checks ls -l
    Nodes
      n0
       attributes
         _ars_checks: PciDeviceInventory
    
  2. Use the cw-healthctl tool to view the list of available health checks:

    [admin@head0 ~]$ cw-healthctl ls
    Health Checks
      MountAvailability
      NtpSynchronization
      PciDeviceInventory
      SystemdServices
    
  3. Add another health check to the compute node:

    cw-nodectl -i n0 set _ars_checks=PciDeviceInventory,NtpSynchronization
    

To add health checks to a node that has checks assigned via labels:

  1. View the list of health checks and labels assigned to a node:

    cw-healthctl list --nodes n1 --show-labels
    
  2. List all checks with a specific label to make sure the check you want to add isn't already included in the label:

    [admin@head0 ~]$ cw-healthctl -i %infiniband ls
    Health Checks
      InfiniBandLinkState
      InfiniBandPortErrors
    
  3. Add the new health check or health check label to the node:

    cw-nodectl -i n0 set _ars_checks=%infiniband,MountAvailability,SystemEventLog