Health Check Primitives#

Health Check primitives represent compute node health checks. Health checks are used to define the script and related command to run, how often it runs, and what failure threshold should trigger reporting. If you are using the Auto Remediation Service (ARS), reported health check failures can trigger automated remediation workflows.

Health checks run on configured compute nodes at scheduled intervals to detect issues and report results back to the ClusterWareAI ™ head node. The health check API is used to create, modify, and view details about health check primitives and to set default values for health check fields, such as interval, timeout, and so on.

When issuing requests, the UID field in the URL can be either the actual UID of the primitive or the name of the primitive as given in the "name" field.

Data Fields#

Health Check primitives have several fields:

name
   Required: The name of the health check. Names must start with an
   alphabet character, not a number.

description
   Optional: A text description for the health check.

command
   Required: The health check script command to run, including required
   arguments.

labels
   Optional: One or more labels used to associate the health check with
   nodes. Compute nodes with matching labels in the _ars_checks reserved
   attribute automatically run the check.

interval
   Optional: How frequently the health check runs, in seconds. If unset, the
   default value is used.

timeout
   Optional: How long the health check is allowed to run before timing out,
   in seconds. If unset, the default value is used.

fail_streak
   Optional: The number of consecutive failures allowed before the failure
   is reported. If unset, the default value is used.

fail_percentage
   Optional: The percentage of recent runs that can fail before the failure
   is reported. If unset, the default value is used.

Additional Endpoints#

In addition to the standard create, read, update, and delete operations described in Basic Operations, the health check API provides several additional endpoints:

GET /checks
   Returns a list of all health checks.

GET /checks/default
   Returns the default values for health check fields.

PATCH /checks/default
    Update the default values for health check fields.

GET /checks/<node_id>
   Returns the list of health checks assigned to a node. <node_id> can be a
   node name, UID, or "CURRENT".

Example#

List all health checks:

curl -X GET https://head1.cluster.local/api/v1/checks \
     -H "Authorization: Bearer <access_token>" -H "Content-Type: application/json"
{"success":true,"data":["35a30c9349224e6093eae7f4f3c0010a","a914cfbf1ccf4e9a9fe23921a3ebf7da"]}

Update default field values:

curl -X PATCH https://head1.cluster.local/api/v1/checks/defaults \
     -H "Authorization: Bearer <access_token>" \
     -H "Content-Type: application/json" \
     -d '{"interval": 90}'
{"success":true}

Create a new health check:

curl -X POST https://head1.cluster.local/api/v1/checks \
     -H "Content-Type: application/json" --data '{"name": "SampleCheckGPU", \
     "command": "sample-check-gpu.py", "labels": ["gpu"], "timeout":90, \
     "interval":120} -H "Authorization: Bearer <access_token>"
{"success":true,"data":"c9f74242745142358b325e4834808fcf"}

Get health check details by name:

curl -X GET "https://head1.cluster.local/api/v1/check/zombie" \
     -H "Authorization: Bearer <access_token>" \
     -H "Content-Type: application/json" | jq
{
  "success": true,
  "data": {
    "name": "zombie",
    "labels": [
      "cpu"
    ],
    "interval": 30000,
    "timeout": 10,
    "command": "check_zombie.py -w 10 -c 20",
    "description": "Check zombie process on the node",
    "last_modified": 1776693966.9676054,
    "last_modified_on": "head0.cluster.local",
    "last_modified_by": "admin",
    "uid": "35a30c9349224e6093eae7f4f3c0010a"
  }
}

Update an existing health check:

curl -X PATCH https://head1.cluster.local/api/v1/check/SampleCheckGPU \
     --data '{"interval":240,"timeout":15}' \
     -H "Authorization: Bearer <access_token>"
{"success":true}

List the health checks assigned to a node:

curl -X GET https://head1.cluster.local/api/v1/checks/n42 \
     -H "Authorization: Bearer <access_token>" -H "Content-Type: application/json" | jq
{
  "success": true,
  "data": [
    {
      "name": "zombie",
      "labels": [
        "cpu"
      ],
      "interval": 30000,
      "timeout": 10,
      "command": "check_zombie.py -w 10 -c 20",
      "description": "Check zombie process on the node",
      "last_modified": 1776693966.9676054,
      "last_modified_on": "head0.cluster.local",
      "last_modified_by": "admin",
      "uid": "35a30c9349224e6093eae7f4f3c0010a"
    }
  ]
}