Figure D.3, shows a sample action script to set a node offline within a resource management system, namely Torque, if an error occured.
The script called pbs_set_offline.sh should
be copied to the directory
/etc/parastation/healthcheck/testsets/all/actions.
A symlink in the directory
/etc/parastation/healthcheck/testsets/testset/actions
should point to this script.
After a failed healthcheck run with test set
testset, the local node would be set
off-line with an appropriate comment, like
pshealthcheck - 2010-09-01 01:10:23 - 1/31 - ethernet_eth0 - prologue
Figure D.1. Example action script output
Use the command pbsnodes -ln to check for automatically off-lined nodes:
# pbsnodes -ln
jf49c02 offline pshealthcheck - 2010-09-01 \
01:10:23 - 1/31 - ethernet_eth0 - prologue
...
Figure D.2. Sample pbsnodes output
#!/bin/bash
#
# ParaStation Healthchecker
#
# Copyright (C) 1999-2004 ParTec AG, Karlsruhe
# Copyright (C) 2005-2010 ParTec Cluster Competence Center GmbH,
# Munich
#
# Variables exported by calling pshealthcheck:
# VERBOSE, LOGGING, TESTSET,
# TS_COUNT_OK, TS_COUNT_WARN, TS_COUNT_ERR,
# TS_LIST_OK, TS_LIST_WARN, TS_LIST_ERR
MAX_PBS_NOTE="55"
HOSTNAME=`hostname`
FAILED_SCRIPTS="${TS_LIST_ERR//, /,}"
# calculate statistics
((TS_COUNT_TOTAL=TS_COUNT_OK+TS_COUNT_WARN+TS_COUNT_ERR))
# set the node offline
msg=$(pbsnodes -ln "$HOSTNAME" | awk '{print $3}' 2>/dev/null)
[ "$VERBOSE" -ge 2 ] && echo "Setting node '$HOSTNAME' in pbs offline
..."
if [ -z "$msg" ]; then
if [ "${#FAILED_SCRIPTS}" -gt "$MAX_PBS_NOTE" ]; then
FAILED_SCRIPTS="${TS_LIST_ERR:0:$MAX_PBS_NOTE-3}"
FAILED_SCRIPTS="$FAILED_SCRIPTS..."
fi
new_msg="pshealthcheck - `date +%Y-%m-%d\ %H:%M:%S`"
new_msg="${new_msg} - (${TS_COUNT_ERR}/${TS_COUNT_TOTAL})"
new_msg="${new_msg} - $FAILED_SCRIPTS - ts $TESTSET"
pbsnodes -o -N "${new_msg}" "$HOSTNAME" || {
echo "ERROR: setting node offline failed!";
exit 2;
}
else
pbsnodes -o "$HOSTNAME" || {
echo "ERROR: setting node offline failed!";
exit 2;
}
fi
Figure D.3. Example test script