Configure Automated Remediation Service (ARS)#
The clusterware-ars package is installed by default when you run the
cw-install script. However, additional configuration is required before you
can use the auto remediation service (ARS)
in your ICE ClusterWare ™ cluster.
Important
The configuration requires you to power off all compute nodes that you want to monitor. Consider completing the configuration during a scheduled maintenance window or configuring nodes in batches to minimize impact on cluster availability.
If your cluster is in an air-gapped environment, contact Penguin Computing before you begin ARS configuration.
Prerequisites#
Install and configure Slurm and strigger on an administration node or virtual machine separate from your head node. See Configure Slurm for details.
Install and configure a Sensu backend server. See Configure Sensu for details.
ClusterWare ARS Configuration#
Configure ARS on a ClusterWare head node. In a multi-head cluster, only one head node is configured with ARS and all configuration should take place on the same head node.
Use the
cw-clusterctltool to link the Sensu backend to the ClusterWare software:cw-clusterctl --set-health-location <Sensu backend server> cw-clusterctl --set-health-username <Sensu username> cw-clusterctl --set-health-password <Sensu password>
Update configuration files.
Update the
/opt/scyld/clusterware/conf/base.inifile and add an entry to provide your MQTT password:mosquitto.pubpass = <MQTT password>
Update the
/opt/scyld/clusterware-ars/statemachine.cfgfile to set:cw_userto the ClusterWare admin user.mqtt_hosttolocalhost.
Update the
/opt/scyld/clusterware-ars/skyhook-config.inifile to setcw_userto the ClusterWare admin user.Update the
/opt/scyld/clusterware-ars/remediation_config.jsonfile to set:cw_userto the ClusterWare admin user.sensu_urlto the Sensu backend server location.sensu_admin_userandsensu_admin_passwordto the Sensu admin username and password.lock_directoryto a path that the ClusterWare admin user has write access to. The target directory will hold small lock files to ensure a node running a remediation plan does not need to be re-triaged until the plan completes.
Compute Node Configuration#
Note
Enabling automated remediation on administration nodes, such as a cluster login node, is not recommended. Some of the automated remediation solutions can take the node offline and, if set on an administration node, could interrupt cluster availability.
To complete ARS configuration, create a compute node image with appropriate packages, then add that image to a boot configuration and set reserved attributes on the nodes you want to monitor.
All compute nodes you want to monitor must have a valid power URI configuration. See Compute Node Power Control for details.
Acquire the Sensu agent base configuration:
sudo curl -L https://docs.sensu.io/sensu-go/latest/files/agent.yml -o agent.yml
Modify the
agent.ymlfile and update thebackend-urlto the Sensu backend server.Within ClusterWare, create an image based on the DefaultImage:
cw-imgctl -i DefaultImage clone name=arsImage
On the Slurm controller node, add the new image:
slurm-cw.setup update-image arsImage
Within ClusterWare, add the Sensu agent to the node image:
cw-modimg -i arsImage --copyin "sensu_stable.repo" "/etc/yum.repos.d/sensu_stable.repo" \ --execute "mkdir -p /etc/sensu" --copyin "agent.yml" "/etc/sensu/agent.yml"
Enter a
chrootto configure the image:cw-modimg -i arsImage --chroot --overwrite --upload
Configure the image with the
sensu-agentservice:firewall-offline-cmd --zone=public --add-port=2379/tcp --add-port=2380/tcp \ --add-port=3000/tcp --add-port=6060/tcp --add-port=8080/tcp --add-port=8081/tcp \ --add-port=3030/tcp --add-port=3030/udp --add-port=3031/tcp --add-port=8125/udp dnf clean all dnf install sensu-go-agent -y systemctl enable sensu-agent sed -i "s|^ExecStart=.*$|ExecStart=/bin/sh -c '/usr/sbin/sensu-agent start \ -c /etc/sensu/agent.yml --name \$\$(hostname -s)'|" /usr/lib/systemd/system/sensu-agent.service
Install the
clusterware-ansiblepackage:dnf install -y --nogpgcheck --releasever=$RELEASEVER clusterware-ansible
Add the Ansible collections:
export LC_ALL=C.UTF-8 /opt/scyld/clusterware-ansible/env/bin/ansible-galaxy collection install community.general /opt/scyld/clusterware-ansible/env/bin/ansible-galaxy collection install sensu.sensu_go
Install Python dependencies:
dnf install -y python3-pip python3 -m pip install requests paho-mqtt psutil
Exit the
chroot. The image contents are automatically re-packed and replaced.
Create a new boot configuration that references the new image:
cw-add-boot-config --boot-config arsBoot --image arsImage
Create a new attribute group:
cw-attribctl create name=arsAttribs
Add the boot configuration and set other reserved attributes on the attribute group:
cw-attribctl -i arsAttribs set _boot_config=arsBoot _ars_state=new_node _ars_groups=compute,gpu \ _ansible_pull=http://<head node>/api/v1/repo/healthiso/content/health.git:health.yaml \ _ansible_retries=tries=3,delay=30,maxwait=300 _ansible_pull_args="--full -i inventory.ini"
Note
The default health checks have
subscriptionsto either acomputeorgpuARS group. This example command adds both ARS groups to the attribute group. Review thedefaults/check-bundle.ymlfile on your Sensu backend node to see subscription mapping and adjust ARS group membership accordingly.Join the compute nodes you want to monitor to the attribute group:
cw-nodectl -i n[0-15] join arsAttribs
Power off the compute nodes:
cw-nodectl -i n[0-15] power off
Start the
ars-state-machine,ars-skyhook, andars-auto-remediationservices:systemctl start ars-state-machine ars-skyhook ars-auto-remediation
After the services start, the compute nodes should power on, move to the Provisioning state, and start booting. You can check the node state using the
--arsargument:[admin@head1]$ cw-nodectl -i n[0-15] status --ars n[0-1] available n[2-15] provisioning
Add Compute Nodes after Initial ARS Configuration#
You can add nodes to the attribute group you created any time after initial configuration and start health monitoring with ARS.
Join the compute nodes you want to monitor to the attribute group:
cw-nodectl -i n[16-20] join arsAttribs
Power off the compute nodes:
cw-nodectl -i n[16-20] power off
Restart the
ars-state-machineservice:systemctl restart ars-state-machine
After the service restarts, the nodes should power on, move to the Provisioning state, and start booting.
[admin@head1]$ cw-nodectl -i n[16-20] status --ars n[16-20] provisioning