Managing Node Failures#
In a large cluster the failure of individual compute nodes should be anticipated and planned for. Since many compute nodes are diskless, recovery should be relatively simple, consisting of rebooting the node once any hardware faults have been addressed. Disked nodes may require additional steps depending on the importance of the data on disk. Refer to your operating system documentation for details.
A compute node failure can unexpectedly terminate a long running computation involving that node. We strongly encourage authors of such programs to use techniques such as application checkpointing to ensure that computations can be resumed with minimal loss.
Tip
If a booting compute node downloads the .efi binary from the
ICE ClusterWare ™ head node and then fails with an "Access Denied" or "Permission
Denied" error, check if Secure Boot is enabled on the node. The ClusterWare
software is not currently compatible with Secure Boot. See
Securing the Cluster for ClusterWare security features and options.
Replacing Failed Nodes#
Since nodes are identified by their MAC addresses, replacing a node in the database is relatively simple. If the node (n23 in the following example) was repaired, but the same network interface is still being used, then no changes are necessary. However, if it was the network card that failed and it was replaced, then the node's MAC address can be updated with one command:
cw-nodectl -i n23 update mac=44:22:33:44:55:66
If the entire node was replaced, you can use the _no_boot attribute to
temporarily remove the node from ClusterWare. You can also update the
description of the node with details such as the RMA number or anticipated
replacement timeline. When the new node arrives, run the command above to
modify the MAC address to match the replacement and then remove the _no_boot
attribute to rejoin the node to the ClusterWare cluster.
If the entire node was replaced, then you may prefer to clear the node status and any history associated with that node instead of just updating the MAC address. To do this, delete and recreate the failed node:
cw-nodectl -i n23 delete
cw-nodectl create index=23 mac=44:22:33:44:55:66
Note
Deleting a node from the ClusterWare cluster removes all of the configuration and settings for the node.
If a message similar to No free leases appears in the ClusterWare DHCP log
after replacing a node, identify the root cause by reviewing the following:
Confirm the MAC address is assigned to the node you expect by running
cw-nodectl -i <MAC address> ls. You can also do this by making sure the MAC address that DHCP sees matches your expectations.Confirm the DHCP request is coming in on the expected interface. It is possible that the network interface on the head node was configured with a different network than the IP that was supposed to be given out. If this is the case, correct the wiring or the IP on the missing interface.
Make sure ClusterWare is aware of the network you are giving out the IP on. Run
cw-clusterctl nets ls -lto make sure the network that contains the IP you expect the node to get is shown. You can correct the network definition using thecw-clusterctl netscommand.If the previous checks all pass, examine the generated
dhcpd.confanddhcpd.leasesfiles located in/opt/scyld/clusterware-iscdhcpd/conf. Thedhcpd.conffile should contain a subnet definition that matches your network range. Thedhcpd.leasesfile should contain an entry that includes the correct MAC calledhardware ethernetcoupled with the correct IP address calledfixed-address. If the MAC address is present multiple times in the file, review the final entry as later entries overwrite earlier entries. If there is something wrong in either of those files, re-check your settings per the first steps, then restart the clusterware service withsystemctl restart clusterware. Wait a few seconds and thedhcpd.confanddhcpd.leasesfiles should regenerate. If they are still incorrect, contact Penguin Computing for assistance.