Custom Health Checks#

The ClusterWareAI ™ software ships with a set of default health checks. You can create additional health checks to match your cluster environment. For example, if you have specific hardware and want to track failures or if you commonly mount a directory in your compute nodes and want to validate the location, you can write checks specific to your needs.

At a high level, use the following steps to create a custom health check and add it to compute nodes:

  1. Create a check script.

  2. Add the check script to compute nodes.

  3. Create a new health check using cw-healthctl, the checks API, or the ClusterWareAI GUI.

  4. Map the new health check to compute nodes.

Create Health Check Script#

A health check script must be an executable that can run on the target OS and that has an exit code that conforms to Nagios. See https://nagios-plugins.org/doc/guidelines.html for details.

Health check scripts are often shell scripts like Bash or Python, but compiled languages also work.

Add the Health Check Script to Compute Nodes#

After creating your health check script, add the script and any dependencies to the compute nodes that you want to run the script. There are a few options for adding the script and dependencies to the nodes:

  • Use cw-modimg to modify an existing image and copy the health check script and any dependencies to a known location in the image. This option is useful if you want to use the same health check on multiple nodes and want the script and dependencies to be maintained after reboot. See the example below and Modifying Images for details.

  • Use cw-nodectl scp to copy the health check script to a known location on applicable compute nodes. This option is useful if you do not want to reboot the nodes to add the custom script. This option does not work for Kubernetes worker compute nodes running with the containerized ClusterWareAI node agent as they do not support scp.

  • Use an Ansible script to copy the health check script and dependencies to all applicable compute nodes. This option is useful if your script has library dependencies. See Using Ansible for details.

For example, to add a custom health check script and related dependencies to an existing image:

  1. Modify the existing image with chroot:

    cw-modimg -i nodeImage --chroot --overwrite --upload
    
  2. Copy the custom health check script to a known location:

    mkdir -p /health-checks
    cp /check/custom-check.py /health-checks
    
  3. Install dependencies in the image:

    dnf install python3.14
    
  4. Exit the chroot. The image contents are automatically re-packed and replaced.

  5. Reboot the compute nodes that use the image to apply the changes.

Create Health Check#

Create a new health check that references the absolute path to custom health check script you added to the compute nodes. You can create the health check using the checks API, cw-healthctl tool, or the ARS Policy Page in the ClusterWareAI GUI.

For example, using a JSON file with the cw-healthctl tool:

  1. Create a JSON file that references the absolute path to your health check script. For example:

    {
     "name": "CustomCheck",
     "command": "/health-checks/custom-check.py",
     "interval": 240,
     "timeout": 10,
     "labels": ["gpu"]
    }
    
  2. Within ClusterWareAI, create a health check:

    cw-healthctl create --content=@customcheck.json
    
  3. [Optional] Verify the health check details:

    [admin@head0]$ cw-healthctl -i CustomCheck ls -l
    Health Checks
     CustomCheck
      command: /health-checks/custom-check.py
      interval: 240
      labels: [gpu]
      name: CustomCheck
      timeout: 10
    

See Create Health Checks for additional examples.

Map Check to Compute Nodes#

Health checks are only run on assigned compute nodes (via label or health check name). If you used an existing label when creating the new check, the check is automatically run by nodes with the custom health check script in the specified location that already have the existing label assigned. You can map the check to additional compute nodes or, if you used a new label, add the label to nodes or an attribute group. See Add Health Checks to Nodes for details.