The ParaStation Healthchecker may be included within a resource management system to automatically check nodes before and after running a job. Failed nodes may be set off-line, therefore they may be no longer used for further jobs.
This appendix describes the configuration for
Torque. Other batch queuing system may use a
similar approach.
Figure E.1, shows a prologue script suitable to run the ParaStation Healthchecker on each node prior to the actual job start. In case of a problem, the local node may be automatically set off-line. See Appendix D, how to automatically off-line a node.
If any prologue script fails with an exit code of
2, the job is terminated and immediately
re-scheduled.
Therefore, the job is typically instantaniously re-run on a
slightly different set of nodes.
The failed node is set off-line and an appropriate note is
appended to this node's resource list.
Removing a node and restarting the job is transparent to the users. The administrators should check every so often the list of off-lined nodes for corresponding messages. Alternatively, the action script may be enlarged to notify the administrator via email.
#!/bin/bash
# ParaStation Healthcheck
#
# Copyright (C) 2009 ParTec Cluster Competence Center GmbH, Munich
#
# Prologue script arguments:
export PBS_JOBID=$1
export PBS_USER=$2
export PBS_GROUP=$3
export PBS_JOBNAME=$4
export PBS_LIMITS=$5
export PBS_QUEUE=$6
export PBS_ACCOUNT=$7
# start the ParaStation Healthchecker
#
# set timeout: [-t 240]
# log to syslog: [-l]
#
OLD_PATH=$PATH
export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/parastation/bin
/opt/parastation/bin/pshealthcheck.ng -t 240 -l prologue \
&> /tmp/pshealthcheck_prologue.out
PSHC_EXIT="$?"
PATH=$OLD_PATH
# on any error force an exit status of "2" to requeue the job in
# torque
if [ "$PSHC_EXIT" -eq "0" ] || [ "$PSHC_EXIT" -eq "1" ]; then
exit 0;
fi
exit 2;
Figure E.1. Sample prologue file
This script has to be copied to the file
/var/spool/torque/mom_priv/prologue.
For more information about prologue scripts refer to the
Torque documentation.
The above example runs the Healthchecker with the test set
prologue.
It re-defines the test set's timeout to 240 seconds, which is
suitable for very large systems only.
In addition, each run is recorded in the system's logfile using
option -l of pshealthcheck.