Name

process_placement, spawning — Process placement strategies within ParaStation MPI

Description

The placement of processes for a parallel task within ParaStation MPI can be biased by various environment variables. Setting this variables is optional. If not set, the default behavior will make sure, that the usability of the cluster will not be jeopardized.

As long as overbooking of CPUs is not explicitly required, the algorithm will ensure, that only one process [3] per CPU is spawned. This behavior ensures that parallel applications will influence each other as minimal as possible.

If no environment variables are set, ParaStation MPI tries to select nodes, where fewest compute processes are running and which have the lowest system load. Typically, this nodes are unused. On this nodes, processes are spawned in a manner, that consecutive ranks are placed on the same node, if possible. If there are not enough CPUs available, the spawning facility will not wait for free CPUs and will also not overbook CPUs.

Pre-defined node selection

While starting up a parallel task, the following environment variables control the creation of the temporary node list used internally for spawning processes. Defining one of these variables enable the user to control the placement of processes. Likewise, batch systems might use this variables to place parallel jobs on dedicated nodes.

PSI_NODES

Contains a comma separated list of node ID ranges. Each node ID range consists of a single node ID (numerical value) or a range of node IDs, including both the first and last ID noted and separated by a "-".

E.g. defining the environment variable

  PSI_NODES="0,1,3,17-20"    # must be exported!

will enable the nodes with IDs 0, 1, 3 and 17 up to (and including) 20 to form the partition used by all subsequent parallel task.

PSI_HOSTS

contains a list of host names, separated by white spaces.

Defining the environment variable

  PSI_HOSTS="node0 node1 node3 node17 node18 node19 node20"

will enable the nodes 0, 1, 3 and 17 up to 20 to form the partition used by all subsequent parallel task.

PSI_HOSTFILE

contains a filename listing all desired nodes by name, one per line.

A particular node can be listed several times. Depending on further environment variables, especially PSI_NODES_SORT, this node will be used more than once. This behavior is important for nodes housing more than one CPU.

The environment variables will be evaluated in the listed order. The first defined variable will be used, all following ones will be ignored. E.g., if PSI_NODES is set, PSI_HOSTS and PSI_HOSTNAME will not be recognized.

If none of these variables is set, all nodes within the cluster will be taken into account for the temporary node list.

Node availability and partitioning

In a next step, the previously defined temporary node list will be further checked for various constraints prohibiting the startup of processes on this nodes. In particular, the following verifications are made:

  • Is the node currently available? Which means, is there currently a connection to the ParaStation MPI daemon on this node?

    If a node is shut down or crashed, the connections to the psid(8) on this node will time out, and this node will be declared as "dead".

  • Is the node supposed to run processes?

    Nodes can be excluded from running compute processes by setting the runJobs attribute to no for a node within the configuration file /etc/parastation.conf.

  • Is the node preallocated for other users or group of users?

    Nodes can be preallocated for a user or a group of users by using the set user or set group command while running psiadmin.

  • Is the node currently used exclusively by another task?

    See below.

  • Is the number of "regular" processes[3] currently running on this node less than the maximum number of processes allowed on this node? See set maxproc of psiadmin.

  • Is the number of "regular" processes currently running on this node less than the number of CPUs available on this node?

    ParaStation MPI will only count physical CPUs, even if Hyperthreading is enabled. For more information on physical and logical CPUs, refer to ParaStation MPI Administrator's Guide.

  • Is the communication hardware and the underlying protocol available?

As already mentioned, there are more environment variables influencing the selection for the temporary node list:

  • PSI_EXCLUSIVE

    Only those nodes will be selected, where currently no other process is running. In addition, this nodes will be blocked for further tasks, until the current task terminates.

  • PSI_OVERBOOK

    Normally, only as many processes can be run on a node as CPUs are available. If this environment variable is set, this limitation is no longer considered and any number of processes can be run on this node.

    Currently, defining this variable will also enforce PSI_EXCLUSIVE.

For both variables, it is sufficient to be defined. The actual value will not be recognized.

Sorting nodes

The temporary node list will be sorted accordingly to the value of the environment variable PSI_NODES_SORT, if defined, or the configuration parameter PSINodesSort in parastation.conf(5). See also psiadmin(8).

  • PROC

    Sorting is done according to the number of currently running processes on each node.

  • LOAD or LOAD1

    The node list will be sorted according to system load of the last 1 minute.

  • LOAD_5

    The node list will be sorted according to system load of the last 5 minutes.

  • LOAD_15

    The node list will be sorted according to system load of the last 15 minutes.

  • PROC+LOAD

    The node list will be sorted according to the number of currently running processes per node (PROC) and the load average for the last minute (LOAD) of these particular node.

  • NONE

    No sorting at all is done.

If neither the environment variable PSI_NODES_SORT is defined nor the parameter PSINodesSort in parastation.conf(5) is configured, the partition table will be sorted according to the number of processes per node (PROC). If the variable is defined, but the value is not known, an error will be reported. The value of this variable is not case sensitive.

Beside the listed sorting criteria(s), there are additional ones applied afterwards:

  • Different number of CPUs

    If there are nodes with different number of CPUs, nodes with higher CPU count will be sorted before nodes with less CPU count.

  • Identical number of CPUs

    Nodes with equal CPU count will be sorted according to there ParaStation MPI ID.

These criteria will enforce an explicit order of the temporary node list for all possible states of the particular nodes.

Process placement

The real distribution of the processes on the nodes defined by the temporary node list is controlled by two more environment variables: PSI_OVERBOOK and PSI_LOOP_NODES_FIRST. Depending whether this variables are defined, the processes of a parallel task will be spread on the temporary node list.

  • none defined:

    Beginning with the first node of the temporary node list, processes will be placed on the node as long as the current process count is less than the number of CPUs. This will happen as long as all processes are placed or the list is exhausted. If all nodes in the list are done and there are still processes to place, the startup of the parallel task will be canceled.

  • PSI_LOOP_NODES_FIRST is defined:

    Beginning with the first node of the temporary node list, one process will be placed on each node of the list, if the number of processes on this node is less than the number of CPUs. The end of the list will wrap to the beginning. Searching the list will be done as long as there are still processes left. If there are no more nodes available where the number of processes is less than the number of CPUs, the startup of the parallel task will be canceled.

  • PSI_OVERBOOK is defined:

    If there are at least as many "unused" CPUs on all the nodes of the temporary node list as processes to start, the behavior is identical to the action if this variable is not defined.

    If there are more processes requested than unused CPUs available, the algorithm evenly distributes the processes on all CPUs. The actual placement is done in the order of the temporary node list. Each node will be filled up with the calculated number of processes. Limits defined by the administrator, e.g. set maxproc will be enforced. If not all processes can be placed on a node, the startup of the parallel task will be canceled.

  • PSI_LOOP_NODES_FIRST and PSI_OVERBOOK are defined:

    If there are enough "unused" CPUs available, the behavior for this combination is identical to the behavior describe previously for "PSI_LOOP_NODES_FIRST is defined". Otherwise the placement is done in a manner that the processes are evenly distributed on all nodes in the temporary node list. For this purpose the node list is cyclic traversed and each time a process is placed on the node. All defined limits will be obeyed. If it's not possible to place all processes, the startup of the parallel task will be canceled.

The environment variable PSI_WAIT controls the behavior, if the startup of the task was previously canceled due to node constrains. If not defined, an error will be reported and the process terminates.

If this variable is defined, the startup request will be queued. Each time the resource allocation within the cluster changes, e.g. if a task terminates or a new node is detected, the startup requests queued up to now will be reevaluated as long as the next request cannot be fulfilled. Requests are queued and dequeued in a "first come first server" order. There is only one queue for the entire cluster. It is not possible for a request to bypass other requests, algorithms like "backfilling" are not implemented.

Process pinning

Each process and all its child processes started up on a node may be pinned to a particular CPU-slot (= virtual core). Therefore these processes will not interfere to each other with respect to CPU cycles. On process startup, the system will use the default mapping as defined in the configuration file to map CPU-slots to physical cores. In case this mapping is not appropriate, the environment variable __PSI_CPUMAP may be used to override the default mapping, if user based mapping is enabled.

In case a process is creating child processes or is using threads, the variable PSI_TPP may be used to define how many CPU-slots per process will be allocated and therefore are available for a particular process. This is in particular interesting for applications using OpenMP. Hence, the environment variable OMP_NUM_THREADS is also honored. If both variables are defined, PSI_TPP takes precedence. if none of them is defined, it defaults to 1.

Process pinning may be disabled for a particular job by defining the environment variable __PSI_NO_PINPROC. The value itself is thereby irrelevant. By doing so, it's almost always a good idea to also disable memory binding by defining __PSI_NO_BINDMEM.

See also

ps_environment(7), parastation.conf(5) and psiadmin(8)



[3] The term process in this chapter refers to a compute process, initiated by ParaStation MPI. Other processes running on node, e.g. system processes or daemons are not considered.