Configure Multi-Tenant Cluster within ICE ClusterWare#

Note

Multi-tenant clusters are only available if you have paid for multi-tenant support. Contact Penguin Computing to learn more.

After running the multi-tenancy setup script, some additional ICE ClusterWare ™ configuration is required before creating your first customer tenancy.

  1. Finalize Ethernet switch configuration in the ClusterWare software. See Configure Multi-Tenant Switches below for details.

  2. Add the InfiniBand network to the ClusterWare software. See Add and Configure Multi-Tenant InfiniBand Network below for details.

  3. Identify compute node switch connections using the _neighbors reserved attribute. See Identify Compute Node and Switch Connectivity below for details.

  4. Prepare compute nodes to be added to customer tenancies. See Add Compute Nodes to Quarantine and Available Tenancies below for details.

  5. Create a test tenancy to evaluate configuration. See Create Test Tenancy below for details.

Configure Multi-Tenant Switches#

  1. Add the MTSpineSwitches attribute group to the spine naming pool:

    cw-clusterctl pools -i spine update group=MTSpineSwitches
    
  2. Verify the spine switch authentication credentials are visible in the MTSpineSwitches attribute group:

    cw-switchctl -i %MTSpineSwitches ls -l
    

    The command should show values for the _remote_user and _remote_pass reserved attributes for all switches in the MTSpineSwitches attribute group. If those are not present for a given switch, then the switch is not able to access the services needed for multi-tenancy.

  3. Add the MTLeafSwitches attribute group to the leaf naming pool:

    cw-clusterctl pools -i leaf update group=MTLeafSwitches
    
  4. Verify the leaf switch authentication credentials are visible in the MTLeafSwitches attribute group:

    cw-switchctl -i %MTLeafSwitches ls -l
    
  5. Install the sshpass package on the super-cluster head nodes to enable _remote_pass and _remote_user on the switches.

  6. Install the clusterware-node package on each switch using the same pre-installer script you would use on a compute node. See Support for Diskful Nodes for details. While technically optional for multi-tenant clusters, this step is highly recommended.

Add and Configure Multi-Tenant InfiniBand Network#

Configure your InfiniBand network so ClusterWare can send tenancy configuration updates to your subnet manager software. Currently, OpenSM is supported and should have been configured as part of initial multi-tenancy installation. See Configure OpenSM with ICE ClusterWare for details.

To create the InfiniBand network within ClusterWare, run the following:

cw-clusterctl ibnets create name=<network name> host=<name>@<subnet manager host>

Where:

  • <network name> is the name of the InfiniBand network within the ClusterWare software. Names must start with an alphabet character, not a number.

  • <name> is the username for the subnet manager software user account.

  • <subnet manager host> is the IP address or hostname for the system where the subnet manager software is installed.

For example:

cw-clusterctl ibnets create name="IB Network" host=root@172.18.244.55

When compute nodes are added to a tenancy, such as when a tenancy is created, ClusterWare automatically updates the InfiniBand partition configuration and the nodes are isolated from other tenancies. See Tenancy Network Isolation for details.

Identify Compute Node and Switch Connectivity#

Run the following script on all compute nodes that will be added to tenancies to identify the switch ports each compute node is connected to:

cw-nodectl -i <node names> script lldp_neighbors

See Identify Nearby Network Interfaces for more information and an example.

Important

If you make changes to your networking, re-run the script to update the network connections. Failing to do so could impact network isolation across tenancies.

Add Compute Nodes to Quarantine and Available Tenancies#

When you ran the setup-for-mt script, the ClusterWare software created two tenancies: Available and Quarantine. The Quarantine tenancy is used to complete health checks and remediation on compute nodes before they are added to a customer tenancy. The Available tenancy is used for healthy compute nodes that can be assigned to a customer tenancy.

The Quarantine and Available tenancies do not contain any compute nodes when created.

  1. Assign all compute nodes that you want to add to tenancies to the Quarantine tenancy:

    cw-clusterctl tenancies -i Quarantine assign <node list>
    

    After the compute nodes are added to the Quarantine tenancy, they are powered on and a series of health checks are run.

  2. Monitor node health status (_aim_current_state reserved attribute) within the Quarantine tenancy.

  3. When a compute node has a status of Available, remove the node from the Quarantine tenancy and add it to the Available tenancy:

    $cw-clusterctl tenancies -i Quarantine unassign <healthy node list>
    $cw-clusterctl tenancies -i Available assign <healthy node list>
    

See Available and Quarantine Tenancies for additional information about these tenancies.

Create Test Tenancy#

While optional, it is recommended that you create a test tenancy to ensure node isolation and other configuration is set up properly.

  1. Create a test tenant. See Create a Tenant for details.

  2. Create a test tenancy. See Create a Tenancy for details.

  3. Validate the tenancy is working properly as the superadministrator.

    1. View the list of compute nodes within the tenancy:

      cw-clusterctl tenancies -i <tenancy name> ls -l
      

      The number of nodes should match the number you specified in the terraform file used to create the tenancy.

    2. Remove a node from the tenancy:

      cw-clusterctl tenancies -i <tenancy name> unassign <node name>
      cw-clusterctl tenancies -i Quarantine assign <node name>
      
    3. View the list of compute nodes again:

      cw-clusterctl tenancies -i <tenancy name> ls -l
      

      The removed node should no longer be in the list of nodes.

  4. Validate the tenancy is working properly as the tenancy administrator.

    1. Log in to the tenancy head node as the Managed Tenant user.

    2. View the list of compute nodes within the tenancy:

      cw-nodectl list
      

      The list of nodes should match the list you saw from the super-cluster head node.

    3. Reboot a compute node:

      cw-nodectl -i <node name> reboot then waitfor up
      
    4. Use Slurm to run a simple job on a compute node, such as printing the hostname.

  5. Delete the tenancy. See Delete a Tenancy and Tenant for details.

  6. Verify the compute nodes were returned to the Quarantine tenancy:

    cw-clusterctl tenancies -i Quarantine ls -l
    
  7. Delete the tenant. See Delete a Tenancy and Tenant for details.

  8. Move the healthy compute nodes to the Available tenancy:

    $cw-clusterctl tenancies -i Quarantine unassign <healthy node list>
    $cw-clusterctl tenancies -i Available assign <healthy node list>