The cluster management facility

The cluster management facility
Prev	Chapter 2. Overview	Next

In addition to the pure communication tasks that build the primary requirements of parallel application running on a cluster, further assistance of the runtime environment is needed in order to startup and control these applications.

The main requirements demanded from a cluster management framework are:

Node management: The management system should automatically detect usable nodes. The ability to administrate this nodes from a central point of administration has to be provided.
Fault tolerance: If some nodes are temporarily unavailable due to hardware failure, maintenance or any other reason, the management system must not fail. This situation has to be gracefully treated and normal work with the remaining nodes has to be provided.
Process management: The management system has to provide a transparent mechanism to start parallel tasks. Load balancing has to be considered within this process. Further requirements to special resources only provided by some nodes have to be taken into account. The actual behavior should be widely configurable.
Job control & Error management: If some or all processes forming a parallel task fail unexpectedly, the management system has to reset left over processes and clean up the accessed node in order to keep the cluster in a usable state.

ParaStation MPI fulfills all this requirements by providing a framework for cluster-wide job and resource management. The backbone of the ParaStation MPI management facility is build by the ParaStation MPI daemons psid(8) running on every node of the cluster, communicating underneath each other constantly. Thus the management system is not located on a single node but distributed throughout the whole cluster. This design prevents the creation of single points of failure.

The main tasks of the ParaStation MPI daemons are:

recovery and distribution of local information like the local load values, the state of the locally controlled processes or the condition of local communication interface,
status control of the cluster by receiving the information from all other daemons,
initialization and the status control of the local communication layer,
startup of local processes on request of local or remote processes,
control of the locally started processes,
transmission of input and output of the locally controlled processes,
forwarding of signals to the locally controlled processes and
provision of an interface to the ParaStation MPI cluster management environment.

Node management

In order to provide a fault tolerant node management, ParaStation MPI uses the concept of virtual nodes in contrast to the classical approach of handling statical node lists. The idea behind this concept is not to manage nodes by lists of hostnames provided by the user, but to provide a pool of virtual nodes and request nodes from this list at startup time. These virtual nodes of course represent the physical nodes available. Users normally do not request for special nodes but for a set of virtual nodes.

The main advantage of this concept becomes clear if one or more nodes are temporarily unavailable or even down. While the concept of static nodelists requires the user to change this lists manually, within the setup chosen by ParaStation MPI the pool becomes smaller because of the lacking nodes. On the other hand the user does not have to change the syntax or any configuration in order to get a required number of nodes, at least as long as enough nodes are available.

Albeit the request of selected nodes is not necessary due to the virtual node concept, it is still possible. This behavior is quite useful in the case where a virtual partitioning of the cluster controlled by an external batch system is desired.

Normally the user will request for any N nodes. ParaStation MPI then decides which nodes are available and fit best to the requirements of the user. These nodes will be given to the user and the parallel task will be spawned throughout these nodes.

Further request for virtual nodes posted by any other user will recognize the processes of this parallel task and take them into account when processes have to be spawned, too.

The details of process distribution within the startup of parallel tasks will be discussed in detail in Chapter 3

Fault tolerance

Within ParaStation MPI fault tolerance is given in many ways. First of all the concept of virtual nodes provides stability against node failure in the sense that the cluster as a whole will work even if several nodes are temporarily unavailable.

In addition, ParaStation MPI offers mechanisms in order to increase the reliability and stability of the cluster system. E.g. if a node fails while it is utilized by parallel tasks, this task will be shut down in a controlled manner. Thus the remaining nodes that where used by this task will be cleaned up and released for further jobs.

The distributed concept of ParaStation MPI's management facilities makes the administration of the cluster feasible even if some nodes are down. Furthermore it prevents the emerge of single point of failure that would lead to an unusable cluster due to a local failure on just one node.

Process management

The management facility of ParaStation MPI offers a complete process management system. ParaStation MPI recognizes and controls dependencies between processes building a parallel task on various nodes of the cluster.

The process management includes creation of processes on remote nodes, control of the I/O channels of these remotely started processes and the management of signals across node boundaries.

In contrast to the spawning mechanism used by many other cluster management systems, i.e. the spawning via a rsh/ssh mechanism, the startup of remote processes via ParaStation MPI is very fast since the ParaStation MPI daemon psid(8) is constantly in a standby mode in order to start these processes. No further login or authentication overhead is necessary.

Since ParaStation MPI knows about the dependencies between the processes building a parallel task it is able to take them into account. The processes are no longer independent but form a task in the same sense as the nodes no longer are independent computers but form the cluster as a unique system. The fact that ParaStation MPI handles distributed processes as a unit, plays an important rule especially in the context of job control and error management discussed within the next section.

Furthermore ParaStation MPI takes care that output produced by remote processes is forwarded to the intended destination. This is usually the controlling tty of the I/O handling process, i.e. the process that was initially started in order to bring up the parallel task, but might also be a normal file the output is redirected to. Input directed to the parallel task is forwarded, too. The default is to send it to the process with rank 0, but it might be addressed to any process of the parallel task.

Last but not least ParaStation MPI handles the propagation of signals. This means that signals send to the I/O handling process will be forwarded to all processes of the task.

Job control & error management

Beside the job control already discussed in the previous section, error management plays an important role in order to run a cluster without much administrative engagement on a day to day basis. It is not acceptable to have to clean up remaining processes of crashed parallel task by hand, to lose output due to erroneous processes or to leave the user without meaningful error messages after a parallel task has failed.

ParaStation MPI supports the administrator and the end user within this complex. Since the management facility controls all processes that were notified towards ParaStation MPI, it is capable to take actions in the case that one of the processes fails.

If an unexpected failure of a process is recognized, all processes within the corresponding parallel task will be notified. The parent process of the parallel task will be notified in any case, all other processes will be signaled only on request.

Furthermore all necessary measures will be taken in order to clean up the resources that were allocated by the crashed process.

Besides the fact that all output that was “on the line” when the failure took place will reach its final destination, the user will get feedback about the kind of error that crashed the process. This feedback is usually of the form “Got sigxxx on node nodenum”.

Integration into existing environments

Beside the inherent management capabilities of ParaStation MPI it is prepared for easy interaction with more evolved batch systems as e.g. LSF, OpenPBS or PBS PRO. This enables a ParaStation MPI cluster to get embedded into an existing environment or even to build a node within a Grid.

The integration of a ParaStation MPI cluster into an existing environment relieves the end user from learning just another runtime environment. Furthermore it makes the use of a cluster more stream lined with an existing site policy concerning the use of supercomputing facilities.