Using the Healthchecker within a job's epilogue

Similar to a prologue script, the Healthchecker may be run after a job terminates using an epilogue script. To do so, the following epilogue script should be copied to the file /var/spool/torque/mom_priv/epilogue:

  #               ParaStation Healthcheck
  # Copyright (C) 2009 ParTec Cluster Competence Center GmbH, Munich

  # Epilogue script arguments:
  export PBS_JOBID=$1
  export PBS_USER=$2
  export PBS_GROUP=$3
  export PBS_JOBNAME=$4
  export PBS_SESSION_ID=$5
  export PBS_LIMITS=$6
  export PBS_RESSOURCES=$7
  export PBS_QUEUE=$8
  export PBS_ACCOUNT=$9

  # start the ParaStation Healthchecker
  # set timeout: [-t 240]
  # log to syslog: [-l]
  export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/parastation/bin
  /opt/parastation/bin/ -t 240 -l epilogue \
    &> /tmp/pshealthcheck_epilogue.out

  # always exit with 0 to prevent setting the node down in moab
  exit 0

Figure E.2. Sample epilogue file

The above example runs the Healthchecker with the TEST SET epilogue. It also re-defines the TEST SET's timeout to 240 seconds, which again is suitable for very large systems only. Again, each run is recorded in the system's logfile.