The ParaStation Healthchecker may be included within a resource management system to automatically check nodes before and after running a job. Failed nodes may be set off-line, therefore they may be no longer used for further jobs.
This appendix describes the configuration for
Torque
. Other batch queuing system may use a
similar approach.
Figure E.1, shows a prologue script suitable to run the ParaStation Healthchecker on each node prior to the actual job start. In case of a problem, the local node may be automatically set off-line. See Appendix D, how to automatically off-line a node.
If any prologue script fails with an exit code of
2
, the job is terminated and immediately
re-scheduled.
Therefore, the job is typically instantaniously re-run on a
slightly different set of nodes.
The failed node is set off-line and an appropriate note is
appended to this node's resource list.
Removing a node and restarting the job is transparent to the users. The administrators should check every so often the list of off-lined nodes for corresponding messages. Alternatively, the action script may be enlarged to notify the administrator via email.
#!/bin/bash # ParaStation Healthcheck # # Copyright (C) 2009 ParTec Cluster Competence Center GmbH, Munich # # Prologue script arguments: export PBS_JOBID=$1 export PBS_USER=$2 export PBS_GROUP=$3 export PBS_JOBNAME=$4 export PBS_LIMITS=$5 export PBS_QUEUE=$6 export PBS_ACCOUNT=$7 # start the ParaStation Healthchecker # # set timeout: [-t 240] # log to syslog: [-l] # OLD_PATH=$PATH export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/parastation/bin /opt/parastation/bin/pshealthcheck.ng -t 240 -l prologue \ &> /tmp/pshealthcheck_prologue.out PSHC_EXIT="$?" PATH=$OLD_PATH # on any error force an exit status of "2" to requeue the job in # torque if [ "$PSHC_EXIT" -eq "0" ] || [ "$PSHC_EXIT" -eq "1" ]; then exit 0; fi exit 2;
Figure E.1. Sample prologue file
This script has to be copied to the file
/var/spool/torque/mom_priv/prologue
.
For more information about prologue scripts refer to the
Torque
documentation.
The above example runs the Healthchecker with the test set
prologue
.
It re-defines the test set's timeout to 240 seconds, which is
suitable for very large systems only.
In addition, each run is recorded in the system's logfile using
option -l
of pshealthcheck.