Using Ansible#

A compute node can be configured to execute an Ansible playbook at boot time or after the node is up.

Note

Ubuntu 20.04 is not supported with the clusterware-ansible package. See Supported Distributions and Features for alternatives.

In the following example, the cluster administrator creates a Git repository hosted by the ICE ClusterWare ™ head nodes, adds an extremely simple Ansible playbook to that Git repository, and assigns a compute node to execute that playbook.

  1. Install the clusterware-ansible package into the image (or images) that you want to support execution of an Ansible playbook:

    cw-modimg -i DefaultImage --install clusterware-ansible --upload --overwrite
    
  2. [Optional] Amend the PATH variable to include the Git binaries that are provided as part of the clusterware package in /opt/scyld/clusterware/git/.

    export PATH=/opt/scyld/clusterware/git/bin:${PATH}
    

    Note

    This step is not strictly necessary, though the git in that subdirectory is often significantly more recent than the version normally provided by a base distribution.

  3. Add your personal public key to your ClusterWare admin account:

    cw-adminctl up keys=@/full/path/.ssh/id_rsa.pub
    

    This key is populated into root user's (or _remote_user's) authorized_keys file on newly booted compute nodes. In addition, this provides simple SSH access to the Git repository. See Compute Node Remote Access for details.

  4. [Optional] Add the localhost's host keys to a personal known_hosts file to avoid an SSH warning that can interrupt scripting:

    ssh-keyscan localhost >> ~/.ssh/known_hosts
    
  5. Create a ClusterWare Git repository called "ansible". This repository defaults to public, meaning it is accessible read-only via unauthenticated HTTP access to the head nodes. It should not include unprotected sensitive passwords or keys:

    cw-clusterctl gitrepos create name=ansible
    

    Note

    Being unauthenticated means the HTTP access mechanism does not allow for git push or other write operations. Alternatively, the repository can be marked private (public=False), although it then cannot be used for a client's ansible-pull.

    Initially the repository will include a placeholder text file that can be deleted.

  6. Clone the Git repo to localhost over an SSH connection:

    git clone cwgit@localhost:ansible
    

    Tip

    You can create the clone on any machine that has the appropriate private key and can reach the SSH port of a head node.

  7. Create a simple Ansible playbook to demonstrate the functionality:

    cat >ansible/HelloWorld.yaml <<EOF
    ---
    - name: This is a hello-world example
      hosts: n*.cluster.local
      tasks:
        - name: Create a file called '/tmp/testfile.txt' with the content
          copy:
            content: hello world
            dest: /tmp/testfile.txt
    EOF
    
  8. Add the new playbook to the "ansible" Git repo:

    bash -c "\
      cd ansible; \
      git add HelloWorld.yaml; \
      git -c user.name=Test -c user.email='<test@test.test>' \
             commit --message 'Adding a test playbook' HelloWorld.yaml; \
      git push; \
    "
    

    Note

    Multiple playbooks can co-exist in the Git repo.

    In a cluster with multiple head nodes, an updated Git repository is replicated to other head nodes in the cluster. Any client ansible_pull to any cluster head node sees the same playbook and the same commit history. Replication to other head nodes can require several seconds to complete.

  9. After the playbook is available in the Git repo, configure the compute node to execute ansible-pull to download the playbook at boot time using the _ansible_pull attribute:

    cw-nodectl -i n1 set _ansible_pull=git:ansible/HelloWorld.yaml
    

    If needed, use optional attributes to augment the ansible-pull command:

    • Use the _ansible_pull_args attribute to specify any arguments to the underlying ansible-pull command.

    • Use the _ansible_retries attribute to specify a number of ansible-pull attempts, a delay interval between attempts, and a maximum wait time after first attempt. All are optional and can be used individually or in combination. See Configuring Retries for Ansible Scripts for details.

    Alternatively, to download the playbook from an external Git repository on the server named gitserver:

    cw-nodectl -i n1 set _ansible_pull=http://gitserver/path/to/repo/root:HelloWorld.yaml
    

    Note the colon between the root and HelloWorld.yaml components of the URL path in the previous example. That colon is required to denote the root of the repository within the URL. Any path components found after that colon are assumed to exist within the repository and the entirety of the URL preceding the colon must point to a cloneable Git repository.

    Tip

    Either format can optionally end with "@<gitrev>", where <gitrev> is a specific commit, tag, or branch in the target git repo.

  10. Reboot the node and wait for it to boot to an up status after the playbook executes:

    cw-nodectl -i n1 reboot then waitfor up
    

    During playbook execution, the node remains in the booting status, changing to an up status after the playbook completes, assuming the playbook is not fatal to the node. That status may timeout to down (with no ill effect) when executing a lengthy playbook before switching to up after playbook completion. Administrators are advised to log the Ansible progress to a known location on the booting node, such as /var/log/ansible.log.

  11. Verify that the HelloWorld.yaml playbook executed:

    cw-nodectl -in1 exec cat /tmp/testfile.txt ; echo
    

Updating Ansible Playbook Outside of Boot Time#

The clusterware-ansible package supports another attribute, _ansible_pull_now, which uses the same syntax as _ansible_pull.

When the attribute is present and the service is enabled and started, the node downloads and executes the playbook during the node's next status update event, which occur every 10 seconds by default. Once the node completes execution of the playbook, it directs the head node to prepend "done" to the _ansible_pull_now attribute to ensure the script does not run again. This differs from _ansible_pull, which only applies when a node is booting.

To enable and use _ansible_pull_now:

  1. Enable the cw-ansible-pull-now service inside the chroot image:

    systemctl enable cw-ansible-pull-now
    
  2. On a running compute node, start the service:

    systemctl start cw-ansible-pull-now
    

Configuring Retries for Ansible Scripts#

The _ansible_retries attribute is used with either _ansible_pull or _ansible_pull_now to set up a retry mechanism for ansible-pull commands. Values include tries, delay, and maxwait. For example, the following command attempts to run the Ansible script 4 times or for 60 seconds, whichever finishes first, with a 10 second delay between attempts:

cw-nodectl -i n1 set _ansible_pull=git:ansible/HelloWorld.yaml _ansible_retries=tries=4,delay=10,maxwait=60

If no delay is specified, the attempts occur immediately. If no tries are specified, it tries as many times as possible for 60 seconds. If no maxwait is specified, it tries 4 times. If both tries and maxwait are specified (as in the example above), it tries 4 times or 60 seconds, whichever finishes first. Retry stops when ansible-pull succeeds.

Using Node Attributes with Ansible#

ClusterWare administrators can change how playbooks run by reading ClusterWare node attributes into Ansible variables. The clusterware-node package includes a library of shell functions that can be used. In particular, attribute_value reads an attribute out of a node's configuration.

Inside the playbook, you can register a variable using the output of a command and that command can reference the attribute_value function:

- name: Read the slurm_server attribute
  shell:
    executable: /bin/bash
    cmd: "source /opt/scyld/clusterware-node/functions.sh && attribute_value slurm_server"
  register: slurm_server

This snippet sets an Ansible variable called slurm_server that reads the node attribute of the same name. Any ClusterWare or user-defined attribute can be referenced in this way. If a default value is needed, it can be given as a second argument: attribute_value attrname defaultvalue.

Applying Ansible Playbooks to Images#

Cluster administrators commonly create and deploy a golden image containing all of the necessary libraries, tools, and applications. Given the frequent nature of software updates, the golden image can be out of date soon after it is created. With this in mind, many production clusters collect required changes into an Ansible playbook and then use the _ansible_pull functionality to deploy that playbook to ClusterWare nodes at boot time or to booted nodes using the _ansible_pull_now functionality.

Applying changes from an Ansible playbook adds a delay between when the node begins booting and when the node is ready to accept jobs after fully booting. Eventually this delay becomes cumbersome and the cluster administrator will want to flush the changes out of the playbook and into the image. The cw-modimg –deploy <PATH> command supports executing a local playbook into the chroot.

Using this functionality requires that the clusterware-ansible package is installed on the head node and that the community.general Ansible Galaxy collection is installed for the chroot connection type. The following pair of commands installs the package on the system and installs the Ansible collection for the root user:

sudo dnf install --assumeyes --enablerepo=cw* --enablerepo=scyld* clusterware-ansible
sudo -E /opt/scyld/clusterware-ansible/env/bin/ansible-galaxy \
  collection install community.general

The collection needs to be available to the root user because the ansible-playbook command is executed using sudo to allow full write permissions to all files within the chroot.

The cw-modimg command assumes that any path that ends with .yaml is an Ansible playbook and uses the configured software to execute that playbook within the chroot.

cw-modimg -iDefaultImage --deploy HelloWorld.yaml \
  --progress none --upload --overwrite --discard-on-error

The new --discard-on-error argument prevents the tool from asking for user confirmation before uploading. It assumes that the user wants to keep the result of a successful run, but stop if an error was encountered. The following is an example of the expected output from the previous command:

[admin@cwhead ~]$ cw-modimg -iDefaultImage --deploy HelloWorld.yaml \
     --progress none --upload --overwrite --discard-on-error
Treating HelloWorld.yaml as an ansible playbook
Downloading and unpacking image DefaultImage
Executing step: Ansible ['/opt/scyld/clusterware-ansible/bin/ansible-playbook', 'HelloWorld.yaml']
  DefaultImage : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
  step completed in 0:00:06.2
Executing step: Upload
Repacking DefaultImage
   fixing SELinux file labels...
     done.
Checksumming...
Cleaning up.
Checksumming image DefaultImage
Replacing remote image.
  step completed in 0:09:33.7

The cw-modimg --deploy <PATH> command recognizes when the provided path is an _ansible_pull compatible URL. If the clusterware-ansible package is installed within the image, then the cw-modimg command triggers execution within the unpacked image via the same mechanisms as setting the _ansible_pull reserved attribute. An unpacked image will not include files that are normally generated during the boot process (such as attributes.ini, heads.cache, or other files in /opt/scyld/clusterware-node/etc), so scripts run this way must not rely on such files.