Health Check Primitives#
Health Check primitives represent compute node health checks. Health checks are used to define the script and related command to run, how often it runs, and what failure threshold should trigger reporting. If you are using the Auto Remediation Service (ARS), reported health check failures can trigger automated remediation workflows.
Health checks run on configured compute nodes at scheduled intervals to detect issues and report results back to the ClusterWareAI ™ head node. The health check API is used to create, modify, and view details about health check primitives and to set default values for health check fields, such as interval, timeout, and so on.
When issuing requests, the UID field in the URL can be either the actual
UID of the primitive or the name of the primitive as given in the "name"
field.
Data Fields#
Health Check primitives have several fields:
name
Required: The name of the health check. Names must start with an
alphabet character, not a number.
description
Optional: A text description for the health check.
command
Required: The health check script command to run, including required
arguments.
labels
Optional: One or more labels used to associate the health check with
nodes. Compute nodes with matching labels in the _ars_checks reserved
attribute automatically run the check.
interval
Optional: How frequently the health check runs, in seconds. If unset, the
default value is used.
timeout
Optional: How long the health check is allowed to run before timing out,
in seconds. If unset, the default value is used.
fail_streak
Optional: The number of consecutive failures allowed before the failure
is reported. If unset, the default value is used.
fail_percentage
Optional: The percentage of recent runs that can fail before the failure
is reported. If unset, the default value is used.
Additional Endpoints#
In addition to the standard create, read, update, and delete operations described in Basic Operations, the health check API provides several additional endpoints:
GET /checks
Returns a list of all health checks.
GET /checks/default
Returns the default values for health check fields.
PATCH /checks/default
Update the default values for health check fields.
GET /checks/<node_id>
Returns the list of health checks assigned to a node. <node_id> can be a
node name, UID, or "CURRENT".
Example#
List all health checks:
curl -X GET https://head1.cluster.local/api/v1/checks \
-H "Authorization: Bearer <access_token>" -H "Content-Type: application/json"
{"success":true,"data":["35a30c9349224e6093eae7f4f3c0010a","a914cfbf1ccf4e9a9fe23921a3ebf7da"]}
Update default field values:
curl -X PATCH https://head1.cluster.local/api/v1/checks/defaults \
-H "Authorization: Bearer <access_token>" \
-H "Content-Type: application/json" \
-d '{"interval": 90}'
{"success":true}
Create a new health check:
curl -X POST https://head1.cluster.local/api/v1/checks \
-H "Content-Type: application/json" --data '{"name": "SampleCheckGPU", \
"command": "sample-check-gpu.py", "labels": ["gpu"], "timeout":90, \
"interval":120} -H "Authorization: Bearer <access_token>"
{"success":true,"data":"c9f74242745142358b325e4834808fcf"}
Get health check details by name:
curl -X GET "https://head1.cluster.local/api/v1/check/zombie" \
-H "Authorization: Bearer <access_token>" \
-H "Content-Type: application/json" | jq
{
"success": true,
"data": {
"name": "zombie",
"labels": [
"cpu"
],
"interval": 30000,
"timeout": 10,
"command": "check_zombie.py -w 10 -c 20",
"description": "Check zombie process on the node",
"last_modified": 1776693966.9676054,
"last_modified_on": "head0.cluster.local",
"last_modified_by": "admin",
"uid": "35a30c9349224e6093eae7f4f3c0010a"
}
}
Update an existing health check:
curl -X PATCH https://head1.cluster.local/api/v1/check/SampleCheckGPU \
--data '{"interval":240,"timeout":15}' \
-H "Authorization: Bearer <access_token>"
{"success":true}
List the health checks assigned to a node:
curl -X GET https://head1.cluster.local/api/v1/checks/n42 \
-H "Authorization: Bearer <access_token>" -H "Content-Type: application/json" | jq
{
"success": true,
"data": [
{
"name": "zombie",
"labels": [
"cpu"
],
"interval": 30000,
"timeout": 10,
"command": "check_zombie.py -w 10 -c 20",
"description": "Check zombie process on the node",
"last_modified": 1776693966.9676054,
"last_modified_on": "head0.cluster.local",
"last_modified_by": "admin",
"uid": "35a30c9349224e6093eae7f4f3c0010a"
}
]
}