Known Issues And Workarounds#
The following are known issues of significance with the latest ICE ClusterWare™ version and suggested workarounds.
Scyld OpenMPI versions 4.0 and 4.1 for RHEL/CentOS 8 require ucx version 1.9 or greater, which is available from CentOS 8 Stream and RHEL 8.4.
When using a TORQUE or Slurm job scheduler (see Job Schedulers), if a node reboots whose image was not created using
/opt/scyld/clusterware-tools/bin/sched-helper
, then the cluster administrator must manually restart the job scheduler.For example, if needed for a single node n0:
NODE=n0 torque-scyld-node
orNODE=n0 slurm-scyld-node
. Or to restart on all nodes:torque-scyld.setup cluster-restart
orslurm-scyld.setup cluster-restart
.Ideally, compute node images are updated using
torque-scyld.setup update-image
orslurm-scyld.setup update-image
, which installs the TORQUE/Slurm config file in the image and enables the appropriate service at node startup.If administrators are using
scyld-modimg
to concurrently modify two different images, then one administrator will see a message of the form:WARNING: Local cache contains inconsistencies. Use --clean-local to delete temporary files, untracked files, and remove missing files from the local manifest.
then use
scyld-modimg --clean-local
.However, only execute
--clean-local
after allscyld-modimg
image manipulations have completed.Ensure that
/etc/sudoers
does not contain the line Defaults requiretty; otherwise, DHCP misbehaves.The NetworkManger-config-server package includes a
NetworkManager.conf
config file with an enabled "no-auto-default" setting. That is incompatible with ClusterWare compute node images and will cause nodes to lose network connectivity after their boot-time DHCP lease expires. Either disable that setting or remove the NetworkManger-config-server package from compute node images.The
scyld-clusterctl repos create
command has aurls=
argument that specifies where the new repo's contents can be found. The most common use isurls=http://<URL>
. The alternativeurls=file://<pathname>
does not currently work. Instead, you must first manually create an http-accessible repo from that pathname. See Creating Local Repositories without Internet.When moving a head node from one etcd-based cluster to another using the
managedb join
command, please reboot the joining head once the join is complete.If a new head node is failing to join an existing etcd-based cluster, check
/var/log/clusterware/etcd.log
and look for repeated lines of the form:<DATE> <SERVER> etcd: added member <HEX> [<URL>:52380] to cluster <HEX>
If the log file contains multiple of these line per join attempt, then please try running
managedb recover
on an existing head node and joining all head nodes back into the cluster one-at-a-time. Re-joining heads that were previously in the cluster may require a--purge
argument, i.e.managedb join --purge <IP>
scyld-install
performs its early check to determine if a newer clusterware-installer RPM is available by parsing the appropriate clusterware repo file (typically/etc/yum.repos.d/clusterware.repo
) to find the first base_url= line. If there are multiple such lines, i.e., specifying multiple ClusterWare repos, then the cluster administrator should order the repos so that the repo containing the newest RPMs is the first repo in the file.A compute node using a version of clusterware-node older than 11.2.2 and booting from a head node that has upgraded to 11.7.0 or newer may not successfully send its status to the head node. Please upgrade the
clusterware-node
package inside the image to resolve this problem.Joining a ClusterWare 11 head node to a ClusterWare 12 head node will perform the join, but will not update the joining head node to ClusterWare 12. We recommend updating the ClusterWare 11 node to 12 prior to performing the join. See Updating ClusterWare 11 to ClusterWare 12 for guidance about performing this update.
Updating from 12.0.1 and earlier to 12.1.0 requires reconfiguration of the Influx/Telegraf monitoring stack. The following command can be used to update the necessary config files:
/opt/scyld/clusterware/bin/influx_grafana_setup --tele-env
, followed bysystemctl restart telegraf
. All data will persist through the upgrade.When using ignition to make partitions the partition specified with
_disk_root
will be formatted with the ext4 file system even if the ignition file specifies another file system such as XFS.