ARS Remedy Primitives#
A remedy primitive represents an activity completed by the ClusterWareAI ™ auto remediation service (ARS) to automatically resolve a compute node health check failure. ARS must be enabled before remedies are applied automatically after a heath check failure. See Configure ARS for details.
Use the remedy API to create, modify, delete, and view remedy definitions, including the severity, impact, runnable class, follow-up actions, and associated health checks.
When issuing requests, the UID field in the URL can be the actual UID of
the primitive or the name of the primitive as given in the "name" field.
For example, a remedy named node_reboot can be referenced through
/remedy/node_reboot or /remedy/<UID>.
Data Fields#
Remedy primitives have several fields:
name
Required: The name of the remedy. Names must start with an alphabet
character, not a number.
description
Optional: A text description for the remedy.
runnable
Required: The Python class name used to run the remedy. The value must be
a valid Python identifier.
arguments
Optional: A dictionary of arguments for the runnable.
after
Optional: A list of action names to run after the remedy completes. Each
entry must be a valid action object name.
severity
Optional: A floating point severity score from ``0.0`` to ``1.0``. ARS
uses severity when selecting a remedy for a failed health check.
impact
Optional: A floating point impact score from ``0.0`` to ``1.0``. ARS uses
impact when selecting a remedy for a failed health check.
checks
Optional: The health checks this remedy can resolve.
confidence
Required in each ``checks`` entry: A floating point score from ``0.0`` to
``1.0`` describing the likelihood that the remedy fixes the failing
health check with a required ``reason`` string.
Additional Endpoints#
The remedy API provides the following endpoints:
GET /remedies
Returns a list of all remedies.
POST /remedies
Creates a remedy.
GET /remedy/<REMEDY>
Returns details about a given remedy (by name or UID).
PATCH /remedy/<REMEDY>
Updates an existing remedy (by name or UID).
DELETE /remedy/<REMEDY>
Deletes a remedy (by name or UID).
GET /remedies/defaults
Returns stored default settings for remedy fields.
PATCH /remedies/defaults
Sets defaults for remedy fields.
Note
If you are interested in creating remedies specific to your cluster environment, contact Penguin Computing for assistance.
Example#
List all remedies:
curl -X GET https://head1.cluster.local/api/v1/remedies \
-H "Authorization: Bearer <access_token>" \
-H "Content-Type: application/json"
{"success":true,"data":["35a30c9349224e6093eae7f4f3c0010a","a914cfbf1ccf4e9a9fe23921a3ebf7da"]}
List details about a remedy by name:
curl -X GET https://head1.cluster.local/api/v1/remedy/node_reboot \
-H "Authorization: Bearer <access_token>" \
-H "Content-Type: application/json" | jq
{
"success": true,
"data": {
"name": "node_reboot",
"description": "Hard reboot the node.",
"runnable": "ClusterWareRebootRemediationPlan",
"after": [
"send_auto_fixed",
"send_require_manual_fix"
],
"severity": 0.7,
"impact": 0.8,
"uid": "35a30c9349224e6093eae7f4f3c0010a"
}
}
Update the severity and impact for a remedy:
curl -X PATCH https://head1.cluster.local/api/v1/remedy/node_reboot_gpu \
--data '{"severity":0.7,"impact":0.6}' \
-H "Authorization: Bearer <access_token>" \
-H "Content-Type: application/json"
{"success":true}