Remedies and Actions#

A remedy is a planned activity that the ClusterWareAI ™ software completes during auto remediation of a compute node that failed one or more Health Checks and has ARS enabled. Remedies can be as simple as rebooting the compute node or may be hardware-specific, such as restarting services on NVIDIA GPUs. Each remedy has a list of health checks it can resolve. A remedy can optionally have zero, one, or multiple related actions. An action is the step or steps taken after the remedy completes. A common action is to move the compute node to the Provisioning state.

Use the cw-remedyctl and cw-actionctl tools to manage remedies and actions.

Health Checks and Remedies#

Health checks can have zero, one, or multiple remedies.

  • If a health check does not have an associated remedy and the health check fails, the failure is logged and the node eventually ends up in the Work Queue state when the auto remediation service cannot identify a remedy.

  • If a health check fails and has a list of possible remedies, the ClusterWareAI software sorts the list and prioritizes based on severity, impact, and confidence that the remedy will resolve the problem.

  • If more than one health check fails, the list of remedies from all failing health checks is considered and an appropriate remedy is selected based on overall severity, impact, and confidence across all failing checks.

Use the cw-healthctl tool to view the list of remedies associated with a health check:

cw-healthctl -i <health check name> ls -l

Remedy Impact and Severity#

Remedies have an associated severity and impact. These values are used by the ClusterWareAI software when deciding which remedy to use when a given health check fails. The severity indicates how disruptive a given remedy is to running workloads. The impact reflects how much production capacity the remedy consumes. The values use a 0.0-1.0 float scale. For example, a remedy with a severity or impact of less than 0.2 may have a minimal risk, such as clearing counters. Rebooting a node has a higher impact on workloads and therefore has a default impact value of 0.9. The severity and impact values can be adjusted for individual remedies to match the significance to your cluster.

Use the cw-remedyctl tool to adjust the impact and severity for a remedy:

cw-remedyctl -i node_reboot_gpu up impact=0.9 severity=0.7

Each health check associated with a remedy has a confidence value. The confidence value indicates how well the remedy should resolve the problem. Like impact and severity, confidence values use a 0.0 to 1.0 float scale. If a health check has multiple associated remedies, ARS uses the confidence value to help determine which remedy to apply to the node with the failed check.

Default Remedies and Actions#

The ClusterWareAI software provides a set of default remedies and actions. View the list of default remedies and actions on the ARS Policy Page of the ClusterWareAI GUI or you can use the ARS explorer tool or the cw-remedyctl tool to view a list of available remedies and associated health checks and actions. For example, the cw-remedyctl tool shows all details about each remedy (output abbreviated):

[admin@head1 ~]$ cw-remedyctl ls -l
remedies
  RebootNodeOs
    after
      send_auto_fixed
      send_require_manual_fix
    description: Initiate an OS-level reboot via...
    impact: 0.8
    name: RebootNodeOs
    runnable: ClusterWareRebootRemediationPlan
    severity: 0.7

  RestartSlurmd
    checks
      Slurmd
        confidence: 0.85
        reason: Restarting the Slurm service can fix some issues related...
    description: Restart the Slurm compute node daemon (slurmd) on the node.
    impact: 0.6
    name: RestartSlurmd
    runnable: CommandRemediationPlan
    severity: 0.5

  Wait10Seconds
    arguments
      command: sleep_time_s: 10
    checks
      ZombieProcesses
        confidence: 0.5
        reason: Short wait can allow transient zombie cleanup before...
    description: Sleep for 10 seconds before...
    impact: 0.0
    name: Wait10Seconds
    runnable: WaitRemediationPlan
    severity: 0.0

It is possible to create custom remedies that reuse the runnables provided by the default remedies. For example, you could create a remedy that runs a series of commands on a node based on an existing runnable. If you are interested in creating specialized remedies specific to your cluster environment, contact Penguin Computing for assistance.

Modify Remedies#

You can modify the default remedies to update fields, such as the impact, severity, or associated health checks.

For example, to update the health checks associated with the node_reboot remedy:

cw-remedyctl -i node_reboot up checks=SystemErrors,ZombieProcesses

ARS Explorer and Interactive Tool#

Use the ARS explorer tool to view the list of available health checks and associated remedies. You can format the tool output as a tree or as a table. For example (output abbreviated):

[admin@head1 ~]$ sudo /opt/scyld/clusterware-ars/bin/ars_explorer print --format table

Check                 Severity  Impact  Suggested Remedies         Confidence
AmdGpuHealth          0.8       0.8
AmdGpuRasErrors       0.8       0.7
AnsibleFailure        0.5       0.4
BlockDevices          0.6       0.5
ContainerRuntime      0         0       RestartContainerRuntime    0.85
.
.
.
ZombieProcesses       0.3       0.2     Wait10Seconds              0.5

There is also an interactive test mode to see the selected list of remedies if a compute node has one or more health check failures. Use the test command to explore scenarios and to help validate your impact, severity, and confidence values for remedies and associated health checks.

For example, if the ContainerRuntime, Kubelet, and ZombieProcesses health checks all fail, three remedies are suggested:

[admin@head1 ~]$ sudo /opt/scyld/clusterware-ars/bin/ars_explorer test

Checks                               Sorted Suggestions
[ ] AmdGpuHealth                     RestartKubelet
[ ] AmdGpuRasErrors                  RestartContainerRuntime
[ ] AnsibleFailure                   Wait10Seconds
[ ] BlockDevices
[X] ContainerRuntime
[ ] CpuGovernor
[ ] DcgmDiagnostics
[ ] EthernetLinkStatus
[ ] FabricmanagerLog
[ ] FilesystemCapacity
[ ] GpuClockDrift
[ ] GpuFunctionalTest
[ ] GpuSettings
[ ] GpuXidErrors
[ ] InfiniBandLinkStatus
[ ] InfiniBandPortErrors
[ ] InternetConnectivity
[ ] IpmiSensors
[ ] IpmiSystemEventLog
[ ] KernelModules
[X] Kubelet
[ ] MemoryConfiguration
[ ] MountAvailability
[ ] NcclBandwidthTest
[ ] NtpSynchronization
[ ] NvidiaGpuHealth
[ ] NvmeSmartHealth
[ ] PciDevices
[ ] RcclBandwidthTest
[ ] Slurmd
[ ] SystemErrors
[ ] SystemServices
[ ] SystemdService
[X] ZombieProcesses

The top suggestion, RestartKubelet, is what ARS would apply. If you want a different priority order for the remedies selected, use the cw-remedyctl tool to update the impact, severity, or confidence value for the remedies and test the scenario again.