ARS Remedy Primitives#

A remedy primitive represents an activity completed by the ClusterWareAI ™ auto remediation service (ARS) to automatically resolve a compute node health check failure. ARS must be enabled before remedies are applied automatically after a heath check failure. See Configure ARS for details.

Use the remedy API to create, modify, delete, and view remedy definitions, including the severity, impact, runnable class, follow-up actions, and associated health checks.

When issuing requests, the UID field in the URL can be the actual UID of the primitive or the name of the primitive as given in the "name" field. For example, a remedy named node_reboot can be referenced through /remedy/node_reboot or /remedy/<UID>.

Data Fields#

Remedy primitives have several fields:

name
   Required: The name of the remedy. Names must start with an alphabet
   character, not a number.

description
   Optional: A text description for the remedy.

runnable
   Required: The Python class name used to run the remedy. The value must be
   a valid Python identifier.

arguments
   Optional: A dictionary of arguments for the runnable.

after
   Optional: A list of action names to run after the remedy completes. Each
   entry must be a valid action object name.

severity
   Optional: A floating point severity score from ``0.0`` to ``1.0``. ARS
   uses severity when selecting a remedy for a failed health check.

impact
   Optional: A floating point impact score from ``0.0`` to ``1.0``. ARS uses
   impact when selecting a remedy for a failed health check.

checks
   Optional: The health checks this remedy can resolve.

confidence
   Required in each ``checks`` entry: A floating point score from ``0.0`` to
   ``1.0`` describing the likelihood that the remedy fixes the failing
   health check with a required ``reason`` string.

Additional Endpoints#

The remedy API provides the following endpoints:

GET /remedies
   Returns a list of all remedies.

POST /remedies
   Creates a remedy.

GET /remedy/<REMEDY>
   Returns details about a given remedy (by name or UID).

PATCH /remedy/<REMEDY>
   Updates an existing remedy (by name or UID).

DELETE /remedy/<REMEDY>
   Deletes a remedy (by name or UID).

GET /remedies/defaults
   Returns stored default settings for remedy fields.

PATCH /remedies/defaults
   Sets defaults for remedy fields.

Note

If you are interested in creating remedies specific to your cluster environment, contact Penguin Computing for assistance.

Example#

List all remedies:

curl -X GET https://head1.cluster.local/api/v1/remedies \
     -H "Authorization: Bearer <access_token>" \
     -H "Content-Type: application/json"
{"success":true,"data":["35a30c9349224e6093eae7f4f3c0010a","a914cfbf1ccf4e9a9fe23921a3ebf7da"]}

List details about a remedy by name:

curl -X GET https://head1.cluster.local/api/v1/remedy/node_reboot \
     -H "Authorization: Bearer <access_token>" \
     -H "Content-Type: application/json" | jq
{
  "success": true,
  "data": {
    "name": "node_reboot",
    "description": "Hard reboot the node.",
    "runnable": "ClusterWareRebootRemediationPlan",
    "after": [
      "send_auto_fixed",
      "send_require_manual_fix"
    ],
    "severity": 0.7,
    "impact": 0.8,
    "uid": "35a30c9349224e6093eae7f4f3c0010a"
  }
}

Update the severity and impact for a remedy:

curl -X PATCH https://head1.cluster.local/api/v1/remedy/node_reboot_gpu \
     --data '{"severity":0.7,"impact":0.6}' \
     -H "Authorization: Bearer <access_token>" \
     -H "Content-Type: application/json"
{"success":true}