Figure D.3, shows a sample action script to set a node offline within a resource management system, namely Torque, if an error occured.
The script called pbs_set_offline.sh
should
be copied to the directory
/etc/parastation/healthcheck/testsets/all/actions
.
A symlink in the directory
/etc/parastation/healthcheck/testsets/
testset
/actions
should point to this script.
After a failed healthcheck run with test set
testset
, the local node would be set
off-line with an appropriate comment, like
pshealthcheck - 2010-09-01 01:10:23 - 1/31 - ethernet_eth0 - prologue
Figure D.1. Example action script output
Use the command pbsnodes -ln to check for automatically off-lined nodes:
# pbsnodes -ln jf49c02 offline pshealthcheck - 2010-09-01 \ 01:10:23 - 1/31 - ethernet_eth0 - prologue ...
Figure D.2. Sample pbsnodes output
#!/bin/bash # # ParaStation Healthchecker # # Copyright (C) 1999-2004 ParTec AG, Karlsruhe # Copyright (C) 2005-2010 ParTec Cluster Competence Center GmbH, # Munich # # Variables exported by calling pshealthcheck: # VERBOSE, LOGGING, TESTSET, # TS_COUNT_OK, TS_COUNT_WARN, TS_COUNT_ERR, # TS_LIST_OK, TS_LIST_WARN, TS_LIST_ERR MAX_PBS_NOTE="55" HOSTNAME=`hostname` FAILED_SCRIPTS="${TS_LIST_ERR//, /,}" # calculate statistics ((TS_COUNT_TOTAL=TS_COUNT_OK+TS_COUNT_WARN+TS_COUNT_ERR)) # set the node offline msg=$(pbsnodes -ln "$HOSTNAME" | awk '{print $3}' 2>/dev/null) [ "$VERBOSE" -ge 2 ] && echo "Setting node '$HOSTNAME' in pbs offline ..." if [ -z "$msg" ]; then if [ "${#FAILED_SCRIPTS}" -gt "$MAX_PBS_NOTE" ]; then FAILED_SCRIPTS="${TS_LIST_ERR:0:$MAX_PBS_NOTE-3}" FAILED_SCRIPTS="$FAILED_SCRIPTS..." fi new_msg="pshealthcheck - `date +%Y-%m-%d\ %H:%M:%S`" new_msg="${new_msg} - (${TS_COUNT_ERR}/${TS_COUNT_TOTAL})" new_msg="${new_msg} - $FAILED_SCRIPTS - ts $TESTSET" pbsnodes -o -N "${new_msg}" "$HOSTNAME" || { echo "ERROR: setting node offline failed!"; exit 2; } else pbsnodes -o "$HOSTNAME" || { echo "ERROR: setting node offline failed!"; exit 2; } fi
Figure D.3. Example test script