Using the Healthchecker within a job's epilogue

Similar to a prologue script, the Healthchecker may be run after a job terminates using an epilogue script. To do so, the following epilogue script should be copied to the file /var/spool/torque/mom_priv/epilogue:

  #!/bin/bash
  #               ParaStation Healthcheck
  #
  # Copyright (C) 2009 ParTec Cluster Competence Center GmbH, Munich
  #

  # Epilogue script arguments:
  export PBS_JOBID=$1
  export PBS_USER=$2
  export PBS_GROUP=$3
  export PBS_JOBNAME=$4
  export PBS_SESSION_ID=$5
  export PBS_LIMITS=$6
  export PBS_RESSOURCES=$7
  export PBS_QUEUE=$8
  export PBS_ACCOUNT=$9

  # start the ParaStation Healthchecker
  #
  # set timeout: [-t 240]
  # log to syslog: [-l]
  OLD_PATH=$PATH
  export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/parastation/bin
  /opt/parastation/bin/pshealthcheck.ng -t 240 -l epilogue \
    &> /tmp/pshealthcheck_epilogue.out
  PSHC_PID="$!"
  PATH=$OLD_PATH

  # always exit with 0 to prevent setting the node down in moab
  exit 0
      

Figure E.2. Sample epilogue file


The above example runs the Healthchecker with the TEST SET epilogue. It also re-defines the TEST SET's timeout to 240 seconds, which again is suitable for very large systems only. Again, each run is recorded in the system's logfile.