Modify Health Checks#

Use the cw-healthctl tool to modify existing health checks. A common initial modification to the default health checks is to add labels so a set of checks can be run on groups of similar nodes.

For example, to add a label to multiple health checks:

cw-healthctl -i check_nvidia,check_gpu_settings update label=gpu

You may want to update the health check fields for a particular check, such as to run a check more often.

  1. Review the current interval for the health check:

    [admin@head0]$ cw-healthctl -i check_porterror ls -l
    Health Checks
      check_porterror
        command: check_porterror.py
        fail_percentage: 15
        fail_streak: 2
        interval: 300
        labels: [infiniband]
    
  2. Update the interval for the check:

    cw-healthctl -i check_porterror up interval=120
    

You can also modify a health check using a content file. For example:

cw-healthctl -i check_porterror up --content=@checkporterror.yaml

If you are using ARS, use the cw-remedyctl tool to change the remedies associated with the health check.