Command-Line Monitoring of Nodes#
You can use the cw-nodectl status tool to monitor ICE ClusterWare ™ cluster performance
and health. For example, you can view the status of all nodes in the cluster in
two ways.
Terse status with basic information:
[admin@virthead]$ cw-nodectl status
n[0] up
n[1] down
n[2] new
Verbose status with detailed information:
[admin@virthead]$ cw-nodectl status --long
Nodes
n0
ip: 10.10.24.100
last_modified: 2019-04-16 05:02:26 UTC (0:00:02 ago)
state: up
uptime: 143729.68
n1
down_reason: boot timeout
ip: 10.10.42.102
last_modified: 2019-04-15 09:03:20 UTC (19:59:08 ago)
last_uptime: 59.61
state: down
n2: {}
From this sample verbose output, we can see:
Compute node n0 is up and has recently (2 seconds earlier) sent status information back to the head node. This status information is sent by each compute node to its parent head node once every 10 seconds, although this period can be overridden with the _status_secs node attribute. The IP address shown here is the IP reported by the compute node and should match the IP provided in the node database primitive unless the database has been changed and the node has not rebooted.
Compute node n1 is currently down because of a "boot timeout". This means that the node attempted to boot and the node's initial "up" status message to the head node was not received. This could happen due to a boot failure such as a missing network driver, a networking failure preventing the node from communicating with the head node, or if the
cw-status-updaterservice provided by the clusterware-node package is not running on the compute node. Other possible values for down_reason include "node stopped sending status" or "clean shutdown".Compute node n2 has no status information because it was added to the system and has never booted.
Additional node status can be viewed with cw-nodectl status -L (an
abbreviation of --long-long) that includes the most recent full hostname,
kernel command line, loaded modules, loadavg, free RAM, kernel release, and
SELinux status. As with other cw-*ctl commands, the output can also be
provided as JSON to simplify parsing and scripting by adding the --json
argument.
You can query a field in the status information using s[fieldname] or
status[fieldname]. For example, the following lists all nodes with a
ram_free value greater than 0 and include the value of ram_free:
cw-nodectl -s 's[ram_free] > 0' ls -l
For large clusters, the --long (or -l) display can be unwieldy,
so the status command defaults to a summary.
Each row of output corresponds to a different node status and lists
the nodes in a format that can then be passed to the --ids
argument of cw-nodectl. Passing an additional --refresh
argument causes the tool to start an ncurses application that
displays the summary in the terminal and periodically refresh the
display:
cw-nodectl status --refresh
The --refresh mode can be useful when adding new nodes to the system by
booting them one at a time as described in Node Creation with Unknown MAC address(es).
Note
The ClusterWare platform also provides visual methods to monitor cluster performance and health, including the Nodes Page of the ClusterWare GUI and a Grafana dashboard.