Monitoring Scheduler Info#
Important
This software is a TECHNOLOGY PREVIEW that is being rolled out with limited features and limited support. Customers are encouraged to contact Penguin with suggestions for improvements, or ways that the tool could be adapted to work better in their environments.
The ICE ClusterWare™ platform provides a separate daemon process that reads data from one or more inputs and pushes that data into one or more endpoints. The supported inputs are the Slurm batch scheduler and ClusterWare itself. The supported outputs are ClusterWare, InfluxDB, or an archive file. By selecting appropriate inputs and outputs, one could read data from the ClusterWare software and write it into InfluxDB, or read from Slurm and write into the ClusterWare software.
The scheduler data will show up in the ClusterWare node attributes and can then be viewed with:
$ scyld-nodectl ls -l
Nodes
n0
attributes
_boot_config: DefaultBoot
_sched_state: idle
_sched_extra: Node is idle
_sched_full: { ... full JSON blob }
Loosely speaking, _sched_state is a one-word summary of the state
of the node (allocated, idle, down); _sched_extra is a one-line summary,
potentially giving basic info on why the node might be in that state
(e.g. not responding to pings might lead to a "down" indicator); and
_sched_full is a JSON dump of all the information the scheduler provided
for that node.
With the data in the nodes' attributes, admins can then use those attributes to select groups of nodes and target them with an action. For example, to list all nodes where Slurm is idle:
$ scyld-nodectl -s "attributes[_sched_state]==idle" ls
Nodes
n0
n1
n2
One could similarly reboot all "down" nodes, or remotely execute a command to restart slurmd on any problematic nodes.
Note
At present, only the Slurm scheduler is supported, though other batch schedulers will likely be supported in the future.
sched-watcher Deployment#
sched_watcher runs as a daemon process on a machine that has
network access to both the batch system controller, e.g. slurmctld,
and to the ClusterWare head node.
While one could run sched_watcher directly on a head
node, it is a better practice to run it on a ClusterWare
management node to fully isolate any network or CPU load that
it might generate.
For the sched_watcher server, the command-line tools for the
batch scheduler must be installed, and it will be helpful to have the
ClusterWare tools as well:
yum install clusterware-tools slurm-scyld-node
Copy the files from a head node:
scp -r /opt/scyld/clusterware/src/sched_watcher/* \
mywatcher:/path/to/dest
On the sched_watcher server, prepare the SystemD service:
cp sched_watcher.service /etc/systemd/system/.
Modify the sched_watcher.conf file as needed (see below).
Create an authentication token using the scyld-adminctl tool:
scyld-adminctl token --lifetime 10d --outfile /tmp/cw_bearer_token
The default config file assumes /tmp/cw_bearer_token but any
filename and path could be used. It is also possible to generate this
token elsewhere and scp it to the sched-watcher server.
Enable and start the service:
systemctl enable sched_watcher
systemctl start sched_watcher
Verify Data#
Once the sched_watcher tool is running, it should quickly
push data to the ClusterWare platform and InfluxDB. On a ClusterWare head
or management node, try:
scyld-nodectl ls -l
and verify that the _sched_state and other attributes are
now populated.
Similarly, one can look in the monitoring GUI and the same data should be visible there.
Note
By default, the update cycle is every 30 seconds.
Config Settings#
There are three main sections to the config file: one for
sched_watcher (main) settings, and one for each of the output
options (currently slurm and influxdb).
For sched_watcher, the following options are supported:
token_file_path = /tmp/cw_bearer_tokenPath to the authentication token file.
token_duration = 1hSince the auth-token will have some lifespan embedded within it,
sched_watcherwill periodically re-read the file assuming that it will be refreshed prior to expiration.token_durationsets the time-frame for re-reading the file.
polling_interval = 30Sets the time between sending of updates to the ClusterWare software. A longer interval can potentially reduce the load on the system, but the data will be more out-of-date.
sched_type = slurmSets the "type" of batch scheduler to retrieve data from. At present, only
slurmis supported.
debug_level = 1Enables debugging output.
input = slurmA comma-separated list of input modules that will be used. At present,
slurmandclusterwarearea supported.
output = clusterware, influxdbA comma-separated list of output modules that should be used. It can include one or more of:
clusterware,influxdb, orarchive. If admins do not wish to use InfluxDB/Telegraf monitoring, it can be removed from this list.
For the ClusterWare platform, only one option is currently supported:
base_url = http://parent-head-node/api/v1/Sets the base URL for the ClusterWare platform. Best practice would be to run
sched_watcheron a ClusterWare management node, soparent-head-nodewill be kept up-to-date and will always point at a valid head node.
For Slurm, only one option is currently supported:
base_path = /opt/scyld/slurm/binSets the base path to all of the Slurm command-line tools. This is where
sched_watcherwill look for thesinfoandsqueuetools.
For InfluxDB, the options are:
base_url = udp://parent-head-node:8094Sets the base URL for the InfluxDB service. Best practice would be to not run
sched_watcheron a ClusterWare management node, soparent-head-nodewill be updated to point at a valid head node.
include_sched = trueFor reduced data size,
sched_watchercan enable or disable the_sched_stateinformation.
include_extra = trueBy default,
sched_watcherwill only push the_sched_stateinformation. Setting this totruewill also push the_sched_extra(one line) summary into InfluxDB. At this time, there is no support for sending the full JSON data into InfluxDB.
include_cw_data = falseIndicates if the data from the ClusterWare platform should be written to the InfluxDB endpoint. For example, one might want to archive the ClusterWare data, but not send it to InfluxDB.
drop_cw_fields = *A simple filter system to allow some ClusterWare fields to be dropped, keeping all the others. The
*is a wildcard that matches any number of any character.
keep_cw_fields = a.*A simple filter system that will keep certain ClusterWare fields even if they were otherwise selected by the
drop_cw_fieldsfilter. The*is a wildcard that matches any number of any character
For the archive file output, the options are:
output_file = /tmp/cw_archiveThe full path to the archive file.
rotate_interval = 1dHow often the archive file should be rotated (can use
hfor hours,dfor days).
zip_prog = /usr/bin/gzipIf given, the rotated (old) archive files will be compressed with the given tool to reduce storage requirements.
drop_cw_fields = *A simple filter system to allow some ClusterWare fields to be dropped, keeping all the others. The
*is a wildcard that matches any number of any character.
keep_cw_fields = a.*A simple filter system that will keep certain ClusterWare fields even if they were otherwise selected by the
drop_cw_fieldsfilter. The*is a wildcard that matches any number of any character
Notes#
The code currently runs as root so that it can read the config file in /opt/scyld/clusterware and also read the admin-created token file
The
sched_watchertool cannot refresh the auth-token that it's been given, so there must be some other (out-of-band) process to refresh that bearer token before it expires.
e.g. One could run a weekly cron job that executes the
scyld-adminctl tokencommand.Restart/Reload
sched_watcherafter any changes to the config file.
To reload the config and auth-token files:
systemctl reload sched_watcherTo completely restart the system:
systemctl restart sched_watcherThe archived data is in a straightforward key=value format. Each line includes
time=<Unix timestamp>andhost=<hostname>followed by the data fields for that node.
The raw data is "flattened" into a single-level set of dotted keys, so
clusterware.attributes.foowould becomec.a.foo=value*attributesbecomesa,status:s,hardware:hThere is simple filtering available with some outputs modules:
keep_cw_fieldsanddrop_cw_fields.
Both can be comma-separated lists of fields that should be included or excluded, and both can include a trailing wildcard (
*)
keep_cw_fieldsare retained in the output even if there is a matchingdrop_cw_fieldskey
drop_cw_fields=*andkeep_cw_fields=a.*
drop all fields except for
a.*fields (attributes)
drop_cw_fields=(empty) andkeep_cw_fields=*
retain all fields
Example Config
[sched_watcher]
token_file_path = /tmp/cw_bearer_token
token_duration = 1h
polling_interval = 30
sched_type = slurm
debug_level = 1
# available inputs = clusterware, slurm
input = slurm
# available outputs = archive, clusterware, influxdb
output = clusterware
[clusterware]
base_url = http://parent-head-node/api/v1/
[slurm]
base_path = /opt/scyld/slurm/bin
[influxdb]
base_url = udp://parent-head-node:8094
include_sched = true
include_sched_extra = false
include_cw_data = false
# drop most fields ...
drop_cw_fields = *
# but keep these ...
keep_cw_fields = a.*
[archive]
output_file = /tmp/cw_archive
rotate_interval = 1d
zip_prog = /usr/bin/gzip
# drop_cw_fields = *
keep_cw_fields = *