The ParaStation Healthchecker may be included within a resource management system to automatically check nodes before and after running a job. Failed nodes may be set off-line, therefore they may be no longer used for further jobs.
    This appendix describes the configuration for
    Torque. Other batch queuing system may use a
    similar approach.
  
Figure E.1, shows a prologue script suitable to run the ParaStation Healthchecker on each node prior to the actual job start. In case of a problem, the local node may be automatically set off-line. See Appendix D, how to automatically off-line a node.
      If any prologue script fails with an exit code of
      2, the job is terminated and immediately
      re-scheduled. 
      Therefore, the job is typically instantaniously re-run on a
      slightly different set of nodes.
      The failed node is set off-line and an appropriate note is
      appended to this node's resource list.
    
Removing a node and restarting the job is transparent to the users. The administrators should check every so often the list of off-lined nodes for corresponding messages. Alternatively, the action script may be enlarged to notify the administrator via email.
  #!/bin/bash
  #               ParaStation Healthcheck
  #
  # Copyright (C) 2009 ParTec Cluster Competence Center GmbH, Munich
  #
  # Prologue script arguments:
  export PBS_JOBID=$1
  export PBS_USER=$2
  export PBS_GROUP=$3
  export PBS_JOBNAME=$4
  export PBS_LIMITS=$5
  export PBS_QUEUE=$6
  export PBS_ACCOUNT=$7
  # start the ParaStation Healthchecker
  #
  # set timeout: [-t 240]
  # log to syslog: [-l]
  #
  OLD_PATH=$PATH
  export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/parastation/bin
  /opt/parastation/bin/pshealthcheck.ng -t 240 -l prologue \
    &> /tmp/pshealthcheck_prologue.out
  PSHC_EXIT="$?"
  PATH=$OLD_PATH
  # on any error force an exit status of "2" to requeue the job in
  # torque
  if [ "$PSHC_EXIT" -eq "0" ] || [ "$PSHC_EXIT" -eq "1" ]; then
      exit 0;
  fi
  exit 2;
      Figure E.1. Sample prologue file
      This script has to be copied to the file
      /var/spool/torque/mom_priv/prologue.
      For more information about prologue scripts refer to the
      Torque documentation.
    
      The above example runs the Healthchecker with the test set
      prologue. 
      It re-defines the test set's timeout to 240 seconds, which is
      suitable for very large systems only.
      In addition, each run is recorded in the system's logfile using
      option -l of pshealthcheck.