Known Issues And Workarounds#

The following are known issues of significance with the latest ICE ClusterWare ™ version and suggested workarounds.

  • Each compute node should have a unique UUID. In rare cases, different nodes can end up with a common UUID. This can occur when a compute node is down and you change the IP address on another node. Having duplicate node UUIDs is not likely to impact cluster functionality, but could result in a problem if you change to IPv6 IPs given out via DHCPv6. You can review node UUIDs by running cw-nodectl --fields hardware.product_uuid ls -l.

  • If a booting compute node downloads the .efi binary from the ClusterWare head node and then fails with an "Access Denied" or "Permission Denied" error, check if Secure Boot is enabled on the node. The ClusterWare software is not currently compatible with Secure Boot. See Securing the Cluster for ClusterWare security features and options.

  • If an image is missing the localhost entries in the /etc/hosts file, you can re-add the entries by:

    1. Running cw-modimg --chroot into the image.

    2. Re-adding the localhost entries to the /etc/hosts files.

    3. Setting the /etc/hosts file permissions to 644.

    4. Exiting and saving the image.

  • If a message similar to No free leases appears in the ClusterWare DHCP log after replacing a node, identify the root cause by reviewing the following:

    1. Confirm the MAC address is assigned to the node you expect by running cw-nodectl -i <MAC address> ls. You can also do this by making sure the MAC address that DHCP sees matches your expectations.

    2. Confirm the DHCP request is coming in on the expected interface. It is possible that the network interface on the head node was configured with a different network than the IP that was supposed to be given out. If this is the case, correct the wiring or the IP on the missing interface.

    3. Make sure ClusterWare is aware of the network you are giving out the IP on. Run cw-clusterctl nets ls -l to make sure the network that contains the IP you expect the node to get is shown. You can correct the network definition using the cw-clusterctl nets command.

    4. If the previous checks all pass, examine the generated dhcpd.conf and dhcpd.leases files located in /opt/scyld/clusterware-iscdhcpd/conf. The dhcpd.conf file should contain a subnet definition that matches your network range. The dhcpd.leases file should contain an entry that includes the correct MAC called hardware ethernet coupled with the correct IP address called fixed-address. If the MAC address is present multiple times in the file, review the final entry as later entries overwrite earlier entries. If there is something wrong in either of those files, re-check your settings per the first steps, then restart the clusterware service with systemctl restart clusterware. Wait a few seconds and the dhcpd.conf and dhcpd.leases files should regenerate. If they are still incorrect, contact Penguin Computing for assistance.

  • OpenMPI versions 4.0 and 4.1 for RHEL/CentOS 8 require ucx version 1.9 or greater, which is available from CentOS 8 Stream and RHEL 8.4.

  • Slurm version 23.02.7 cannot be directly updated to Slurm version 25.05.4 while preserving the existing Slurm accounting database.

  • If administrators are using cw-modimg to concurrently modify two different images, then one administrator will see a message similar to:

    WARNING: Local cache contains inconsistencies.
    Use --clean-local to delete temporary files, untracked files,
    and remove missing files from the local manifest.
    

    then use cw-modimg --clean-local.

    However, only execute --clean-local after all cw-modimg image manipulations have completed.

  • Ensure that /etc/sudoers does not contain the line Defaults requiretty; otherwise, DHCP misbehaves.

  • The NetworkManger-config-server package includes a NetworkManager.conf config file with an enabled "no-auto-default" setting. That is incompatible with ClusterWare compute node images and will cause nodes to lose network connectivity after their boot-time DHCP lease expires. Either disable that setting or remove the NetworkManger-config-server package from compute node images.

  • The cw-clusterctl repos create command has a urls= argument that specifies where the new repo's contents can be found. The most common use is urls=http://<URL>. The alternative urls=file://<pathname> does not currently work. Instead, you must first manually create an http-accessible repo from that pathname. See Creating Local Repositories without Internet.

  • When moving a head node from one etcd-based cluster to another using the managedb join command, reboot the joining head once the join is complete.

  • If a new head node is failing to join an existing etcd-based cluster, check /var/log/clusterware/etcd.log and look for repeated lines of the form:

    <DATE> <SERVER> etcd: added member <HEX> [<URL>:52380] to cluster <HEX>
    

    If the log file contains multiple of these lines per join attempt, then try running managedb recover on an existing head node and joining all head nodes back into the cluster one at a time. Re-joining heads that were previously in the cluster may require a --purge argument, such as managedb join --purge <IP>

  • cw-install performs its early check to determine if a newer clusterware-installer RPM is available by parsing the appropriate clusterware repo file (typically /etc/yum.repos.d/clusterware.repo) to find the first base_url= line. If there are multiple such lines, (such as specifying multiple ClusterWare repos), then the cluster administrator should order the repos so that the repo containing the newest RPMs is the first repo in the file.

  • A compute node using a version of clusterware-node older than 11.2.2 and booting from a head node that has upgraded to 11.7.0 or newer may not successfully send its status to the head node. Upgrade the clusterware-node package inside the image to resolve this problem.

  • Joining a ClusterWare 11 head node to a ClusterWare 12 head node will perform the join, but will not update the joining head node to ClusterWare 12. We recommend updating the ClusterWare 11 node to 12 prior to performing the join. See Update ClusterWare 11 to Later Major Versions for guidance about performing this update.

  • Updating from 12.0.1 and earlier to 12.1.0 requires reconfiguration of the Influx/Telegraf monitoring stack. The following command can be used to update the necessary config files: /opt/scyld/clusterware/bin/influx_grafana_setup --tele-env, followed by systemctl restart telegraf. All data will persist through the upgrade.

  • When using Ignition to make partitions, the partition specified with _disk_root will be formatted with the ext4 file system even if the ignition file specifies another file system such as XFS.

  • When booting images based on more recent Linux distributions, such as Ubuntu 24.04 (Noble Numbat), the compute node may fail to boot with a "Image too large to unpack into /sysroot" message displayed on the system terminal. This is caused by a newer version of the umount command. Upgrade to ClusterWare version 13.0 or later or contact Penguin Computing for a patch that can be applied to earlier versions.

  • After updating from ClusterWare 11 to ClusterWare 12, head nodes cannot read the status of some compute nodes. This issue is only present for locally installed systems (systems that are not PXE booting) or for PXE-booted systems where the clusterware-node package is upgraded instead of upgrading the package inside the image and rebooting the system when idle. To resolve this issue manually, update the /etc/clusterware/runtime.sh file on the affected compute nodes by adding curl with a trailing space to the beginning of the CW_CURL_ARGS variable value.