The ParaStation Healthchecker may be included within a resource management system to automatically check nodes before and after running a job. Failed nodes may be set off-line, therefore they may be no longer used for further jobs.
This appendix describes the configuration for
Torque. Other batch queuing system may use a
Figure E.1, shows a prologue script suitable to run the ParaStation Healthchecker on each node prior to the actual job start. In case of a problem, the local node may be automatically set off-line. See Appendix D, how to automatically off-line a node.
If any prologue script fails with an exit code of
2, the job is terminated and immediately
Therefore, the job is typically instantaniously re-run on a
slightly different set of nodes.
The failed node is set off-line and an appropriate note is
appended to this node's resource list.
Removing a node and restarting the job is transparent to the users. The administrators should check every so often the list of off-lined nodes for corresponding messages. Alternatively, the action script may be enlarged to notify the administrator via email.
#!/bin/bash # ParaStation Healthcheck # # Copyright (C) 2009 ParTec Cluster Competence Center GmbH, Munich # # Prologue script arguments: export PBS_JOBID=$1 export PBS_USER=$2 export PBS_GROUP=$3 export PBS_JOBNAME=$4 export PBS_LIMITS=$5 export PBS_QUEUE=$6 export PBS_ACCOUNT=$7 # start the ParaStation Healthchecker # # set timeout: [-t 240] # log to syslog: [-l] # OLD_PATH=$PATH export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/parastation/bin /opt/parastation/bin/pshealthcheck.ng -t 240 -l prologue \ &> /tmp/pshealthcheck_prologue.out PSHC_EXIT="$?" PATH=$OLD_PATH # on any error force an exit status of "2" to requeue the job in # torque if [ "$PSHC_EXIT" -eq "0" ] || [ "$PSHC_EXIT" -eq "1" ]; then exit 0; fi exit 2;
Figure E.1. Sample prologue file
This script has to be copied to the file
For more information about prologue scripts refer to the
The above example runs the Healthchecker with the test set
It re-defines the test set's timeout to 240 seconds, which is
suitable for very large systems only.
In addition, each run is recorded in the system's logfile using
-l of pshealthcheck.