Move Nodes to and from Work Queue#

If a compute node fails a health check and the remedy selected by the Auto Remediation Service (ARS) does not solve the issue, the compute node moves to the Work Queue for a cluster administrator to take action. After solving the problem, the cluster administrator should move the node back to the Provisioning state to be re-tested and re-deployed to the available pool of nodes.

The mqttpub script is used to manually move nodes between states, including from the Work Queue to Provisioning. Manually moving nodes between states is used when a node requires human intervention or for planned maintenance. The script must be run by root. Three arguments are required:

Node name
Transition name, possible values include:
- human_intervention: transition from Work Queue to Provisioning
- manual_drain: transition from Available to Work Queue
Log message

To move a remediated node from Work Queue to Provisioning, use the human_intervention transition:

/opt/scyld/clusterware-ars/bin/mqttpub <node name> human_intervention "<message>"

For example, node n1 was fixed after a manual reboot and is ready to be moved to the Provisioning state:

/opt/scyld/clusterware-ars/bin/mqttpub n1 human_intervention "Node rebooted"

In some cases, a node requires manual maintenance even if an error was not detected. For example, if a node is scheduled for a hardware upgrade. You can manually move a node from Available to Work Queue using the manual_drain transition:

/opt/scyld/clusterware-ars/bin/mqttpub <node name> manual_drain "<message>"

With the manual_drain transition, the node moves from Available, to Draining, to Drained, directly to Work Queue, bypassing the Auto Remediation state.