Configure Automated Remediation Service (ARS)#

The clusterware-ars package is installed by default when you run the cw-install script. However, additional configuration is required before you can use the auto remediation service (ARS) in your ICE ClusterWare ™ cluster.

Important

The configuration requires you to power off all compute nodes that you want to monitor. Consider completing the configuration during a scheduled maintenance window or configuring nodes in batches to minimize impact on cluster availability.

If your cluster is in an air-gapped environment, contact Penguin Computing before you begin ARS configuration.

Prerequisites#

  1. Install and configure Slurm and strigger on an administration node or virtual machine separate from your head node. See Configure Slurm for details.

  2. Install and configure a Sensu backend server. See Configure Sensu for details.

ClusterWare ARS Configuration#

Configure ARS on a ClusterWare head node. In a multi-head cluster, only one head node is configured with ARS and all configuration should take place on the same head node.

  1. Use the cw-clusterctl tool to link the Sensu backend to the ClusterWare software:

    cw-clusterctl --set-health-location <Sensu backend server>
    cw-clusterctl --set-health-username <Sensu username>
    cw-clusterctl --set-health-password <Sensu password>
    
  2. Update configuration files.

    1. Update the /opt/scyld/clusterware/conf/base.ini file and add an entry to provide your MQTT password:

      mosquitto.pubpass = <MQTT password>
      
    2. Update the /opt/scyld/clusterware-ars/statemachine.cfg file to set:

      • cw_user to the ClusterWare admin user.

      • mqtt_host to localhost.

    3. Update the /opt/scyld/clusterware-ars/skyhook-config.ini file to set cw_user to the ClusterWare admin user.

    4. Update the /opt/scyld/clusterware-ars/remediation_config.json file to set:

      • cw_user to the ClusterWare admin user.

      • sensu_url to the Sensu backend server location.

      • sensu_admin_user and sensu_admin_password to the Sensu admin username and password.

      • lock_directory to a path that the ClusterWare admin user has write access to. The target directory will hold small lock files to ensure a node running a remediation plan does not need to be re-triaged until the plan completes.

Compute Node Configuration#

Note

Enabling automated remediation on administration nodes, such as a cluster login node, is not recommended. Some of the automated remediation solutions can take the node offline and, if set on an administration node, could interrupt cluster availability.

To complete ARS configuration, create a compute node image with appropriate packages, then add that image to a boot configuration and set reserved attributes on the nodes you want to monitor.

  1. All compute nodes you want to monitor must have a valid power URI configuration. See Compute Node Power Control for details.

  2. Acquire the Sensu agent base configuration:

    sudo curl -L https://docs.sensu.io/sensu-go/latest/files/agent.yml -o agent.yml
    
  3. Modify the agent.yml file and update the backend-url to the Sensu backend server.

  4. Within ClusterWare, create an image based on the DefaultImage:

    cw-imgctl -i DefaultImage clone name=arsImage
    
  5. On the Slurm controller node, add the new image:

    slurm-cw.setup update-image arsImage
    
  6. Within ClusterWare, add the Sensu agent to the node image:

    cw-modimg -i arsImage --copyin "sensu_stable.repo" "/etc/yum.repos.d/sensu_stable.repo" \
      --execute "mkdir -p /etc/sensu"  --copyin "agent.yml" "/etc/sensu/agent.yml"
    
  7. Enter a chroot to configure the image:

    cw-modimg -i arsImage --chroot --overwrite --upload
    
    1. Configure the image with the sensu-agent service:

      firewall-offline-cmd --zone=public --add-port=2379/tcp --add-port=2380/tcp \
      --add-port=3000/tcp --add-port=6060/tcp --add-port=8080/tcp --add-port=8081/tcp \
      --add-port=3030/tcp --add-port=3030/udp --add-port=3031/tcp --add-port=8125/udp
      
      dnf clean all
      dnf install sensu-go-agent -y
      systemctl enable sensu-agent
      sed -i "s|^ExecStart=.*$|ExecStart=/bin/sh -c '/usr/sbin/sensu-agent start \
      -c /etc/sensu/agent.yml --name \$\$(hostname -s)'|" /usr/lib/systemd/system/sensu-agent.service
      
    2. Install the clusterware-ansible package:

      dnf install -y --nogpgcheck --releasever=$RELEASEVER clusterware-ansible
      
    3. Add the Ansible collections:

      export LC_ALL=C.UTF-8
      /opt/scyld/clusterware-ansible/env/bin/ansible-galaxy collection install community.general
      /opt/scyld/clusterware-ansible/env/bin/ansible-galaxy collection install sensu.sensu_go
      
    4. Install Python dependencies:

      dnf install -y python3-pip
      python3 -m pip install requests paho-mqtt psutil
      
    5. Exit the chroot. The image contents are automatically re-packed and replaced.

  8. Create a new boot configuration that references the new image:

    cw-add-boot-config --boot-config arsBoot --image arsImage
    
  9. Create a new attribute group:

    cw-attribctl create name=arsAttribs
    
  10. Add the boot configuration and set other reserved attributes on the attribute group:

    cw-attribctl -i arsAttribs set _boot_config=arsBoot _ars_state=new_node _ars_groups=compute,gpu \
      _ansible_pull=http://<head node>/api/v1/repo/healthiso/content/health.git:health.yaml \
      _ansible_retries=tries=3,delay=30,maxwait=300 _ansible_pull_args="--full -i inventory.ini"
    

    Note

    The default health checks have subscriptions to either a compute or gpu ARS group. This example command adds both ARS groups to the attribute group. Review the defaults/check-bundle.yml file on your Sensu backend node to see subscription mapping and adjust ARS group membership accordingly.

  11. Join the compute nodes you want to monitor to the attribute group:

    cw-nodectl -i n[0-15] join arsAttribs
    
  12. Power off the compute nodes:

    cw-nodectl -i n[0-15] power off
    
  13. Start the ars-state-machine, ars-skyhook, and ars-auto-remediation services:

    systemctl start ars-state-machine ars-skyhook ars-auto-remediation
    
  14. After the services start, the compute nodes should power on, move to the Provisioning state, and start booting. You can check the node state using the --ars argument:

    [admin@head1]$ cw-nodectl -i n[0-15] status --ars
    n[0-1] available
    n[2-15] provisioning
    

Add Compute Nodes after Initial ARS Configuration#

You can add nodes to the attribute group you created any time after initial configuration and start health monitoring with ARS.

  1. Join the compute nodes you want to monitor to the attribute group:

    cw-nodectl -i n[16-20] join arsAttribs
    
  2. Power off the compute nodes:

    cw-nodectl -i n[16-20] power off
    
  3. Restart the ars-state-machine service:

    systemctl restart ars-state-machine
    
  4. After the service restarts, the nodes should power on, move to the Provisioning state, and start booting.

    [admin@head1]$ cw-nodectl -i n[16-20] status --ars
    n[16-20] provisioning