Modify Health Checks#
Use the cw-healthctl tool to modify existing health checks. A common initial
modification to the default health checks is to add labels so a set of checks
can be run on groups of similar nodes.
For example, to add a label to multiple health checks:
cw-healthctl -i check_nvidia,check_gpu_settings update label=gpu
You may want to update the health check fields for a particular check, such as to run a check more often.
Review the current interval for the health check:
[admin@head0]$ cw-healthctl -i check_porterror ls -l Health Checks check_porterror command: check_porterror.py fail_percentage: 15 fail_streak: 2 interval: 300 labels: [infiniband]Update the interval for the check:
cw-healthctl -i check_porterror up interval=120
You can also modify a health check using a content file. For example:
cw-healthctl -i check_porterror up --content=@checkporterror.yaml
If you are using ARS, use the cw-remedyctl tool to change the remedies associated with the health check.