Command-Line Monitoring of Nodes#

You can use the cw-nodectl status tool to monitor ICE ClusterWare ™ cluster performance and health. For example, you can view the status of all nodes in the cluster in two ways.

Terse status with basic information:

[admin@virthead]$ cw-nodectl status
n[0] up
n[1] down
n[2] new

Verbose status with detailed information:

[admin@virthead]$ cw-nodectl status --long
Nodes
  n0
    ip: 10.10.24.100
    last_modified: 2019-04-16 05:02:26 UTC (0:00:02 ago)
    state: up
    uptime: 143729.68

  n1
    down_reason: boot timeout
    ip: 10.10.42.102
    last_modified: 2019-04-15 09:03:20 UTC (19:59:08 ago)
    last_uptime: 59.61
    state: down

  n2: {}

From this sample verbose output, we can see:

  • Compute node n0 is up and has recently (2 seconds earlier) sent status information back to the head node. This status information is sent by each compute node to its parent head node once every 10 seconds, although this period can be overridden with the _status_secs node attribute. The IP address shown here is the IP reported by the compute node and should match the IP provided in the node database primitive unless the database has been changed and the node has not rebooted.

  • Compute node n1 is currently down because of a "boot timeout". This means that the node attempted to boot and the node's initial "up" status message to the head node was not received. This could happen due to a boot failure such as a missing network driver, a networking failure preventing the node from communicating with the head node, or if the cw-status-updater service provided by the clusterware-node package is not running on the compute node. Other possible values for down_reason include "node stopped sending status" or "clean shutdown".

  • Compute node n2 has no status information because it was added to the system and has never booted.

Additional node status can be viewed with cw-nodectl status -L (an abbreviation of --long-long) that includes the most recent full hostname, kernel command line, loaded modules, loadavg, free RAM, kernel release, and SELinux status. As with other cw-*ctl commands, the output can also be provided as JSON to simplify parsing and scripting by adding the --json argument.

You can query a field in the status information using s[fieldname] or status[fieldname]. For example, the following lists all nodes with a ram_free value greater than 0 and include the value of ram_free:

cw-nodectl -s 's[ram_free] > 0' ls -l

For large clusters, the --long (or -l) display can be unwieldy, so the status command defaults to a summary. Each row of output corresponds to a different node status and lists the nodes in a format that can then be passed to the --ids argument of cw-nodectl. Passing an additional --refresh argument causes the tool to start an ncurses application that displays the summary in the terminal and periodically refresh the display:

cw-nodectl status --refresh

The --refresh mode can be useful when adding new nodes to the system by booting them one at a time as described in Node Creation with Unknown MAC address(es).

Note

The ClusterWare platform also provides visual methods to monitor cluster performance and health, including the Nodes Page of the ClusterWare GUI and a Grafana dashboard.