Failing PXE Network Boot#

If a compute node fails to join the cluster when booted via PXE network boot, there are several places to look, as discussed below.

Rule out physical problems. Check for disconnected Ethernet cables, malfunctioning network equipment, etc.

Confirm the node's MAC is in the database. Search for the node by MAC address to confirm it is registered with the ICE ClusterWare™ system:

scyld-nodectl -i 00:11:22:33:44:55 ls -l

Check the system logs. Specifically look for the node's MAC address in the api_error_log and head_*.log files. These files will contain AUDIT statements whenever a compute node boots, e.g.,

Booting node (MAC=08:00:27:f0:44:35) as iscsi using boot config b7412619fe28424ebe1f7c5f3474009d.

Booting node (MAC=52:54:00:a6:f3:3c) as rwram using boot config f72edc4388964cd9919346dfeb21cd2c.

If there are no "Booting node" log statements, then the failure is most likely happening at the DHCP stage, and the head nodes' isc-dhcpd.log log files may contain useful information.

As a last resort, check if the head node is seeing the compute node's DHCP requests, or whether another server is answering, using the Linux tcpdump utility. The following example shows a correct dialog between compute node 0 (10.10.100.100) and the head node.

[root@cluster ~]# tcpdump -i eth1 -c 10
Listening on eth1, link-type EN10MB (Ethernet),
        capture size 96 bytes
18:22:07.901571 IP master.bootpc > 255.255.255.255.bootps:
        BOOTP/DHCP, Request from .0, length: 548
18:22:07.902579 IP .-1.bootps > 255.255.255.255.bootpc:
        BOOTP/DHCP, Reply, length: 430
18:22:09.974536 IP master.bootpc > 255.255.255.255.bootps:
        BOOTP/DHCP, Request from .0, length: 548
18:22:09.974882 IP .-1.bootps > 255.255.255.255.bootpc:
        BOOTP/DHCP, Reply, length: 430
18:22:09.977268 arp who-has .-1 tell 10.10.100.100
18:22:09.977285 arp reply .-1 is-at 00:0c:29:3b:4e:50
18:22:09.977565 IP 10.10.100.100.2070 > .-1.tftp:  32 RRQ
        "bootimg::loader" octet tsize 0
18:22:09.978299 IP .-1.32772 > 10.10.100.100.2070:
        UDP, length 14
10 packets captured
32 packets received by filter
0 packets dropped by kernel

Verify that |SCW-SHORT| services are running. Check the status of clusterware services with the commands:

systemctl status clusterware
systemctl status clusterware-dhcpd
systemctl status clusterware-dnsmasq

Restart clusterware services from the command line using:

sudo systemctl restart clusterware

Check the switch configuration. If the compute nodes fail to boot immediately on power-up but successfully boot later, the problem may lie with the configuration of a managed switch.

Some Ethernet switches delay forwarding packets for approximately one minute after link is established, attempting to verify that no network loop has been created ("spanning tree"). This delay is longer than the PXE boot timeout on some servers.

Disable the spanning tree check on the switch. The parameter is typically named "fast link enable".