Create Health Checks#

The ClusterWareAI ™ software ships with a set of default health checks, but you may want additional health checks that run at different intervals or have different labels. Use the cw-healthctl tool, the checks API, or the ARS Policy Page in the ClusterWareAI GUI to create a health check.

You can also create your own health checks with custom health check scripts. See Custom Health Checks for details.

Health Check Fields#

Health checks have required and optional fields. Some of the optional fields have default values that are used if you do not set an explicit value when creating the health check.

Field

Description

Required

Default

Name

Name of the health check. Names must start with an alphabet character, not a number.

Yes

N/A

Description

Text description for the health check.

No

N/A

Command

Health check script and required arguments.

Yes

N/A

Interval

How frequently the health check runs (in seconds).

No

60

Timeout

Time the script attempts to run before timing out and emitting an error (in seconds).

No

600

Labels

Name for a group of checks that should run together on sets of nodes.

No

N/A

Fail Streak

Number of subsequent failures allowed before logging the failure.

No

2

Fail Percentage

Percent of the previous checks that can fail before logging the failure.

No

15

To change the default field values, use cw-healthctl defaults. For example, to update the default timeout:

cw-healthctl defaults timeout=30

Create Health Check with Command#

Use the cw-healthctl tool to create a health check.

The following example command creates a health check that references the default NTPSync health check script, but uses a longer clock offset threshold:

cw-healthctl create name=NTPSyncLong command="check_ntp.py --max-offset 10"

Because this new health check does not have any labels, it needs to be manually added to compute nodes.

By contrast, the following example command creates a health check that references a custom health check script, overrides some default field values, and assigns two labels:

cw-healthctl create name=SampleCheckGPU command=sample-check-gpu.py \
  interval=120 timeout=10 labels=nvidia,gpu fail_streak=5 fail_percentage=10

If there are nodes that have the nvidia or gpu label in the _ars_checks reserved attribute, the new health check is added to those nodes the next time they sync and starts to run every 120 seconds. See Custom Health Checks for additional details about creating custom health checks and scripts.

Create Health Check with Clone#

If you want to create a health check that references an existing health check script, but has some differences from existing checks, you can clone the existing check and make updates. For example, you may want to reference different mount locations for NFS directories based on compute node type or update default script command arguments.

  1. View details about the existing health check:

    [admin@head0 ~]$ cw-healthctl -i NTPSync ls -l
    Health Checks
      NTPSync
        command: "check_ntp.py --max-offset 1"
        fail_percentage: 15
        fail_streak: 2
        interval: 60
        labels: [default, base]
    
  2. Clone the existing health check and update the command to extend the --max-offset:

    cw-healthctl -i NTPSync clone name=NTPSyncLong command="check_ntp.py --max-offset 10"
    

Create Health Check with Content File#

If you are creating multiple health checks, it may be easier to use the --content argument and provide a JSON, YAML, or INI file with the details for all health checks.

For example, the following JSON content creates two health checks, SampleCheckGPU and SampleCheckCPU:

[
  {
   "name": "SampleCheckGPU",
   "command": "sample-check-gpu.py",
   "interval": 240,
   "timeout": 10,
   "labels": ["gpu"],
  },
  {
   "name": "SampleCheckCPU",
   "command": "sample-check-cpu.py",
   "interval": 2,
   "timeout": 1,
   "labels": ["cpu", "intel"],
  }
]

You can include this JSON file as content when creating new health checks and both health checks are added to the ClusterWareAI software:

cw-healthctl create --content=@samplechecks.json