Known Issues And Workarounds#

The following are known issues of significance with the latest ICE ClusterWare™ version and suggested workarounds.

  • Scyld OpenMPI versions 4.0 and 4.1 for RHEL/CentOS 8 require ucx version 1.9 or greater, which is available from CentOS 8 Stream and RHEL 8.4.

  • When using a TORQUE or Slurm job scheduler (see Job Schedulers), if a node reboots whose image was not created using /opt/scyld/clusterware-tools/bin/sched-helper, then the cluster administrator must manually restart the job scheduler.

    For example, if needed for a single node n0: NODE=n0 torque-scyld-node or NODE=n0 slurm-scyld-node. Or to restart on all nodes: torque-scyld.setup cluster-restart or slurm-scyld.setup cluster-restart.

    Ideally, compute node images are updated using torque-scyld.setup update-image or slurm-scyld.setup update-image, which installs the TORQUE/Slurm config file in the image and enables the appropriate service at node startup.

  • If administrators are using scyld-modimg to concurrently modify two different images, then one administrator will see a message of the form:

    WARNING: Local cache contains inconsistencies.
    Use --clean-local to delete temporary files, untracked files,
    and remove missing files from the local manifest.
    

    then use scyld-modimg --clean-local.

    However, only execute --clean-local after all scyld-modimg image manipulations have completed.

  • Ensure that /etc/sudoers does not contain the line Defaults requiretty; otherwise, DHCP misbehaves.

  • The NetworkManger-config-server package includes a NetworkManager.conf config file with an enabled "no-auto-default" setting. That is incompatible with ClusterWare compute node images and will cause nodes to lose network connectivity after their boot-time DHCP lease expires. Either disable that setting or remove the NetworkManger-config-server package from compute node images.

  • The scyld-clusterctl repos create command has a urls= argument that specifies where the new repo's contents can be found. The most common use is urls=http://<URL>. The alternative urls=file://<pathname> does not currently work. Instead, you must first manually create an http-accessible repo from that pathname. See Creating Local Repositories without Internet.

  • When moving a head node from one etcd-based cluster to another using the managedb join command, please reboot the joining head once the join is complete.

  • If a new head node is failing to join an existing etcd-based cluster, check /var/log/clusterware/etcd.log and look for repeated lines of the form:

    <DATE> <SERVER> etcd: added member <HEX> [<URL>:52380] to cluster <HEX>
    

    If the log file contains multiple of these line per join attempt, then please try running managedb recover on an existing head node and joining all head nodes back into the cluster one-at-a-time. Re-joining heads that were previously in the cluster may require a --purge argument, i.e. managedb join --purge <IP>

  • scyld-install performs its early check to determine if a newer clusterware-installer RPM is available by parsing the appropriate clusterware repo file (typically /etc/yum.repos.d/clusterware.repo) to find the first base_url= line. If there are multiple such lines, i.e., specifying multiple ClusterWare repos, then the cluster administrator should order the repos so that the repo containing the newest RPMs is the first repo in the file.

  • A compute node using a version of clusterware-node older than 11.2.2 and booting from a head node that has upgraded to 11.7.0 or newer may not successfully send its status to the head node. Please upgrade the clusterware-node package inside the image to resolve this problem.

  • Joining a ClusterWare 11 head node to a ClusterWare 12 head node will perform the join, but will not update the joining head node to ClusterWare 12. We recommend updating the ClusterWare 11 node to 12 prior to performing the join. See Updating ClusterWare 11 to ClusterWare 12 for guidance about performing this update.

  • Updating from 12.0.1 and earlier to 12.1.0 requires reconfiguration of the Influx/Telegraf monitoring stack. The following command can be used to update the necessary config files: /opt/scyld/clusterware/bin/influx_grafana_setup --tele-env, followed by systemctl restart telegraf. All data will persist through the upgrade.

  • When using ignition to make partitions the partition specified with _disk_root will be formatted with the ext4 file system even if the ignition file specifies another file system such as XFS.