Troubleshooting Head Nodes#
Head Node Filesystem Is 100% Full#
If a head node filesystem(s) that contains ICE ClusterWare™ data
(typically the root filesystem) is 100% full,
then the administrator cannot execute scyld-*
commands
and ClusterWare cluster operations will fail.
Remove Unnecessary Objects from the ClusterWare Database#
Remove any unnecessary objects in the database that may be lingering after an earlier aborted operation:
sudo systemctl stop clusterware
sudo rm /opt/scyld/clusterware/storage/*.old.00
sudo systemctl start clusterware
If that does not release enough space to allow the scyld-*
commands to
execute, then delete the entire local cache of database objects:
sudo systemctl stop clusterware
sudo rm -fr /opt/scyld/clusterware/workspace/*
sudo systemctl start clusterware
Investigate InfluxDB Retention of Telegraf Data#
If you continue to see influxdb messages in /var/log/messages
that complain "no space left on device",
or if the size of the /var/lib/influxdb/
directory is excessively large,
then InfluxDB may be retaining too much Telegraf time series data,
aka shards.
Examine with:
sudo systemctl restart influxdb
# View the summation of all the Telegraf shards
sudo du -sh /var/lib/influxdb/data/telegraf/autogen/
# View the space consumed by each Telegraf shard
sudo du -sh /var/lib/influxdb/data/telegraf/autogen/*
If the autogen
directory or any particular autogen
subdirectory
shard consumes a suspiciously large amount of storage,
then examine the retention policy with the influx
tool:
sudo influx
and now within the interactive tool you can can execute influx
commands:
> show retention policies on telegraf
The current ClusterWare defaults are a duration of 168h0m0s (save seven shards of Telegraf data) and a shardGroupDuration of 24h0m0s (each spanning one 24-hour day). You can reduce the current retention policy, if that makes sense for your cluster, with simple command. For example, reduce the above 7-shard duration to five, thereby reducing the number of saved shards by two:
> alter retention policy "autogen" on "telegraf" duration 5d
You can also delete individual unneeded shards. View the shards and their timestamps:
> show shards
and selectively delete any unneeded shard using its id number,
which is found in the show
output's first column:
> drop shard <shard-id>
When finished, exit the influx
tool with:
> exit
See https://docs.influxdata.com/influxdb/v1.8/ for more documentation.
Remove Unnecessary Images and Repos#
If scyld-*
commands can now execute,
then view information for all images and repos, including their sizes:
scyld-imgctl ls -l
scyld-clusterctl repos ls -l
Consider selectively deleting unneeded images with:
scyld-imgctl -i <imageName> rm
and consider selectively deleting unneeded repos with:
scyld-clusterctl repos -i <repoName> rm
Move Large Directories#
If scyld-*
commands still cannot execute,
and if your cluster really does need all its existing images, boot configs,
telegraf history, and other non-ClusterWare filesystem data,
then consider moving extraordinarily large directories
(e.g., /opt/scyld/clusterware/workspace/
,
as specified in /opt/scyld/clusterware/conf/base.ini
)
to another filesystem or even to another server,
and/or add storage space to the appropriate filesystem(s).
Head Nodes Disagree About Compute Node State#
If two linked head nodes disagree about the status of the compute
nodes, this is usually due to clock skew between the head nodes. The
appropriate fix is to ensure that all head nodes are using the same
NTP / Chrony servers. The shared database includes the last time each
compute node provided a status update. If that time is too far in
the past, then a compute node is assumed to have stopped communicating
and is marked as "down". This mark is not recorded in the
database, but is instead applied as the data is returned to the
calling process such as scyld-nodectl status
.
Head Node Failure#
To avoid issues like an Out-Of-Memory condition or similarly preventable failure, head nodes should generally not participate in the computations executing on the compute cluster. As a head node plays an important management role, its failure, although rare, has the potential to impact significantly more of the cluster than the failure of individual compute nodes. One common strategy for reducing the impact of a head node failure is to employ multiple head nodes in the cluster. See Managing Multiple Head Nodes for details.
etcd Database Exceeds Size Limit#
The etcd database has a hard limit of 2GB.
If exceeded, then all scyld-*
commands fail and
/var/log/clusterware/api_error_log will commonly grow in size as each
node's incoming status message cannot be serviced. The ClusterWare
api_error_log may also contain the following text:
etcdserver: mvcc: database space exceeded
Normally a head node thread executes in the background that triggers the discarding of database history (called compaction) and triggers database defragmentation (called defrag) if that is deemed necessary. In the rare event that this thread stops executing, then the etcd database grows until its size limit is reached.
This problem can solved with manual intervention by an administrator. Determine if the etcd database really does exceed its limit. For example:
[admin@head1]$ sudo du -hs /opt/scyld/clusterware-etcd/
2.1G /opt/scyld/clusterware-etcd
shows a size larger than 2GB, so you can proceed with the manual intervention.
First determine the current database revision. For example:
[admin@head1]$ sudo /opt/scyld/clusterware-etcd/bin/etcdctl get --write-out=json does_not_exist
{"header":{"cluster_id":9938544529041691203,"member_id":10295069852257988966,"revision":4752785,"raft_term":7}}
Subtract two or three thousand from the revision value 4752785 and compact to that new value:
[admin@head1]$ sudo /opt/scyld/clusterware-etcd/bin/etcdctl compaction 4750000
compacted revision 4750000
and trigger a defragmentation to reclaim space:
[admin@head1]$ sudo /opt/scyld/clusterware-etcd/bin/etcdctl defrag
Finished defragmenting etcd member[http://localhost:52379]
Then clear the alarm and reload the clusterware service:
[admin@head1]$ sudo /opt/scyld/clusterware-etcd/bin/etcdctl alarm disarm
[admin@head1]$ sudo systemctl reload clusterware
This restarts the head node thread that executes in the background and checks the etcd database size. Everything should now function normally.