Command-Line Monitoring of Nodes#
The ICE ClusterWare™ platform provides two primary methods to monitor cluster
performance and health: the command line scyld-nodectl status
tool
and a more extensive graphical user interface (see Grafana Telemetry Dashboard).
More basic node status can be obtained through the scyld-nodectl
command. For example, a cluster administrator can view the status of
all nodes in the cluster:
# Terse status:
[admin@virthead]$ scyld-nodectl status
n[0] up
n[1] down
n[2] new
# Verbose status:
[admin@virthead]$ scyld-nodectl status --long
Nodes
n0
ip: 10.10.24.100
last_modified: 2019-04-16 05:02:26 UTC (0:00:02 ago)
state: up
uptime: 143729.68
n1
down_reason: boot timeout
ip: 10.10.42.102
last_modified: 2019-04-15 09:03:20 UTC (19:59:08 ago)
last_uptime: 59.61
state: down
n2: {}
From this sample output we can see that n0 is up and has recently (2 seconds earlier) sent status information back to the head node. This status information is sent by each compute node to its parent head node once every 10 seconds, although this period can be overridden with the _status_secs node attribute. The IP address shown here is the IP reported by the compute node and should match the IP provided in the node database object unless the database has been changed and the node has not yet been rebooted.
Compute node n1 is currently down because of a "boot timeout".
This means that the node attempted to boot, and the node's initial
"up" status message to the head node was not received.
This could happen
due to a boot failure such as a missing network driver, a networking
failure preventing the node from communicating with the head node, or
if the cw-status-updater
service provided by the
clusterware-node package is not running on the compute node.
Other possible values for down_reason include "node stopped
sending status" or "clean shutdown".
There is no status information about n2 because it was added to the
system and has never been booted. Additional node status can be viewed
with scyld-nodectl status -L
(an abbreviation of --long-long
)
that includes the most recent full
hostname, kernel command line, loaded modules, loadavg, free RAM,
kernel release, and SELinux status. As with other scyld-*ctl
commands, the output can also be provided as JSON to simplify parsing
and scripting.
For large clusters the --long
(or -l
) display can be unwieldy,
so the status command defaults to a summary.
Each row of output corresponds to a different node status and lists
the nodes in a format that can then be passed to the --ids
argument of scyld-nodectl
. Passing an additional --refresh
argument will cause the tool to start an ncurses application that will
display the summary in the terminal and periodically refresh the
display:
scyld-nodectl status --refresh
This mode can be useful when adding new nodes to the system by booting them one at a time as described in Node Creation with Unknown MAC address(es).