Appendix E. Including the Healthchecker in a resource management system

The ParaStation Healthchecker may be included within a resource management system to automatically check nodes before and after running a job. Failed nodes may be set off-line, therefore they may be no longer used for further jobs.

This appendix describes the configuration for Torque. Other batch queuing system may use a similar approach.

Using the Healthchecker within a job's prologue

Figure E.1, shows a prologue script suitable to run the ParaStation Healthchecker on each node prior to the actual job start. In case of a problem, the local node may be automatically set off-line. See Appendix D, how to automatically off-line a node.

If any prologue script fails with an exit code of 2, the job is terminated and immediately re-scheduled. Therefore, the job is typically instantaniously re-run on a slightly different set of nodes. The failed node is set off-line and an appropriate note is appended to this node's resource list.

Removing a node and restarting the job is transparent to the users. The administrators should check every so often the list of off-lined nodes for corresponding messages. Alternatively, the action script may be enlarged to notify the administrator via email.

  #!/bin/bash
  #               ParaStation Healthcheck
  #
  # Copyright (C) 2009 ParTec Cluster Competence Center GmbH, Munich
  #

  # Prologue script arguments:
  export PBS_JOBID=$1
  export PBS_USER=$2
  export PBS_GROUP=$3
  export PBS_JOBNAME=$4
  export PBS_LIMITS=$5
  export PBS_QUEUE=$6
  export PBS_ACCOUNT=$7

  # start the ParaStation Healthchecker
  #
  # set timeout: [-t 240]
  # log to syslog: [-l]
  #
  OLD_PATH=$PATH
  export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/parastation/bin
  /opt/parastation/bin/pshealthcheck.ng -t 240 -l prologue \
    &> /tmp/pshealthcheck_prologue.out
  PSHC_EXIT="$?"
  PATH=$OLD_PATH

  # on any error force an exit status of "2" to requeue the job in
  # torque
  if [ "$PSHC_EXIT" -eq "0" ] || [ "$PSHC_EXIT" -eq "1" ]; then
      exit 0;
  fi
  exit 2;
      

Figure E.1. Sample prologue file


This script has to be copied to the file /var/spool/torque/mom_priv/prologue. For more information about prologue scripts refer to the Torque documentation.

The above example runs the Healthchecker with the test set prologue. It re-defines the test set's timeout to 240 seconds, which is suitable for very large systems only. In addition, each run is recorded in the system's logfile using option -l of pshealthcheck.