Managing Node Failures#

In a large cluster the failure of individual compute nodes should be anticipated and planned for. Since many compute nodes are diskless, recovery should be relatively simple, consisting of rebooting the node once any hardware faults have been addressed. Disked nodes may require additional steps depending on the importance of the data on disk. Refer to your operating system documentation for details.

A compute node failure can unexpectedly terminate a long running computation involving that node. We strongly encourage authors of such programs to use techniques such as application checkpointing to ensure that computations can be resumed with minimal loss.

Replacing Failed Nodes#

Since nodes are identified by their MAC addresses, replacing a node in the database is relatively simple. If the node (n23 in the following example) was repaired, but the same network interface is still being used, then no changes are necessary. However, if it was the network card that failed and it was replaced, then the node's MAC address can be updated with one command:

scyld-nodectl -i n23 update mac=44:22:33:44:55:66

If the entire node was replaced, you can use the _no_boot attribute to temporarily remove the node from ICE ClusterWare™. You can also update the description of the node with details such as the RMA number or anticipated replacement timeline. When the new node arrives, run the command above to modify the MAC address to match the replacement and then remove the _no_boot attribute to rejoin the node to the ClusterWare cluster.

If the entire node was replaced, then you may prefer to clear the node status and any history associated with that node instead of just updating the MAC address. To do this, delete and recreate the failed node:

scyld-nodectl -i n23 delete
scyld-nodectl create index=23 mac=44:22:33:44:55:66

Note

Deleting a node from the ClusterWare cluster removes all of the configuration and settings for the node.