Custom Health Checks#
The ClusterWareAI ™ software ships with a set of default health checks. You can create additional health checks to match your cluster environment. For example, if you have specific hardware and want to track failures or if you commonly mount a directory in your compute nodes and want to validate the location, you can write checks specific to your needs.
At a high level, use the following steps to create a custom health check and add it to compute nodes:
Create a new health check using
cw-healthctl, the checks API, or the ClusterWareAI GUI.
Create Health Check Script#
A health check script must be an executable that can run on the target OS and that has an exit code that conforms to Nagios. See https://nagios-plugins.org/doc/guidelines.html for details.
Health check scripts are often shell scripts like Bash or Python, but compiled languages also work.
Add the Health Check Script to Compute Nodes#
After creating your health check script, add the script and any dependencies to the compute nodes that you want to run the script. There are a few options for adding the script and dependencies to the nodes:
Use
cw-modimgto modify an existing image and copy the health check script and any dependencies to a known location in the image. This option is useful if you want to use the same health check on multiple nodes and want the script and dependencies to be maintained after reboot. See the example below and Modifying Images for details.Use
cw-nodectl scpto copy the health check script to a known location on applicable compute nodes. This option is useful if you do not want to reboot the nodes to add the custom script. This option does not work for Kubernetes worker compute nodes running with the containerized ClusterWareAI node agent as they do not supportscp.Use an Ansible script to copy the health check script and dependencies to all applicable compute nodes. This option is useful if your script has library dependencies. See Using Ansible for details.
For example, to add a custom health check script and related dependencies to an existing image:
Modify the existing image with chroot:
cw-modimg -i nodeImage --chroot --overwrite --upload
Copy the custom health check script to a known location:
mkdir -p /health-checks cp /check/custom-check.py /health-checks
Install dependencies in the image:
dnf install python3.14
Exit the chroot. The image contents are automatically re-packed and replaced.
Reboot the compute nodes that use the image to apply the changes.
Create Health Check#
Create a new health check that references the absolute path to custom health
check script you added to the compute nodes. You can create the health check
using the checks API, cw-healthctl tool, or the ARS Policy Page in
the ClusterWareAI GUI.
For example, using a JSON file with the cw-healthctl tool:
Create a JSON file that references the absolute path to your health check script. For example:
{ "name": "CustomCheck", "command": "/health-checks/custom-check.py", "interval": 240, "timeout": 10, "labels": ["gpu"] }Within ClusterWareAI, create a health check:
cw-healthctl create --content=@customcheck.json
[Optional] Verify the health check details:
[admin@head0]$ cw-healthctl -i CustomCheck ls -l Health Checks CustomCheck command: /health-checks/custom-check.py interval: 240 labels: [gpu] name: CustomCheck timeout: 10
See Create Health Checks for additional examples.
Map Check to Compute Nodes#
Health checks are only run on assigned compute nodes (via label or health check name). If you used an existing label when creating the new check, the check is automatically run by nodes with the custom health check script in the specified location that already have the existing label assigned. You can map the check to additional compute nodes or, if you used a new label, add the label to nodes or an attribute group. See Add Health Checks to Nodes for details.