Managing Node Failures#

In a large cluster the failure of individual compute nodes should be anticipated and planned for. Since many compute nodes are diskless, recovery should be relatively simple, consisting of rebooting the node once any hardware faults have been addressed. Disked nodes may require additional steps depending on the importance of the data on disk. Refer to your operating system documentation for details.

A compute node failure can unexpectedly terminate a long running computation involving that node. We strongly encourage authors of such programs to use techniques such as application checkpointing to ensure that computations can be resumed with minimal loss.

Tip

If a booting compute node downloads the .efi binary from the ICE ClusterWare ™ head node and then fails with an "Access Denied" or "Permission Denied" error, check if Secure Boot is enabled on the node. The ClusterWare software is not currently compatible with Secure Boot. See Securing the Cluster for ClusterWare security features and options.

Replacing Failed Nodes#

Since nodes are identified by their MAC addresses, replacing a node in the database is relatively simple. If the node (n23 in the following example) was repaired, but the same network interface is still being used, then no changes are necessary. However, if it was the network card that failed and it was replaced, then the node's MAC address can be updated with one command:

cw-nodectl -i n23 update mac=44:22:33:44:55:66

If the entire node was replaced, you can use the _no_boot attribute to temporarily remove the node from ClusterWare. You can also update the description of the node with details such as the RMA number or anticipated replacement timeline. When the new node arrives, run the command above to modify the MAC address to match the replacement and then remove the _no_boot attribute to rejoin the node to the ClusterWare cluster.

If the entire node was replaced, then you may prefer to clear the node status and any history associated with that node instead of just updating the MAC address. To do this, delete and recreate the failed node:

cw-nodectl -i n23 delete
cw-nodectl create index=23 mac=44:22:33:44:55:66

Note

Deleting a node from the ClusterWare cluster removes all of the configuration and settings for the node.

If a message similar to No free leases appears in the ClusterWare DHCP log after replacing a node, identify the root cause by reviewing the following:

Confirm the MAC address is assigned to the node you expect by running cw-nodectl -i <MAC address> ls. You can also do this by making sure the MAC address that DHCP sees matches your expectations.
Confirm the DHCP request is coming in on the expected interface. It is possible that the network interface on the head node was configured with a different network than the IP that was supposed to be given out. If this is the case, correct the wiring or the IP on the missing interface.
Make sure ClusterWare is aware of the network you are giving out the IP on. Run cw-clusterctl nets ls -l to make sure the network that contains the IP you expect the node to get is shown. You can correct the network definition using the cw-clusterctl nets command.
If the previous checks all pass, examine the generated dhcpd.conf and dhcpd.leases files located in /opt/scyld/clusterware-iscdhcpd/conf. The dhcpd.conf file should contain a subnet definition that matches your network range. The dhcpd.leases file should contain an entry that includes the correct MAC called hardware ethernet coupled with the correct IP address called fixed-address. If the MAC address is present multiple times in the file, review the final entry as later entries overwrite earlier entries. If there is something wrong in either of those files, re-check your settings per the first steps, then restart the clusterware service with systemctl restart clusterware. Wait a few seconds and the dhcpd.conf and dhcpd.leases files should regenerate. If they are still incorrect, contact Penguin Computing for assistance.