process_placement, spawning — Process placement strategies within ParaStation MPI
The placement of processes for a parallel task within ParaStation MPI can be biased by various environment variables. Setting this variables is optional. If not set, the default behavior will make sure, that the usability of the cluster will not be jeopardized.
As long as overbooking of CPUs is not explicitly required, the algorithm will ensure, that only one process  per CPU is spawned. This behavior ensures that parallel applications will influence each other as minimal as possible.
If no environment variables are set, ParaStation MPI tries to select nodes, where fewest compute processes are running and which have the lowest system load. Typically, this nodes are unused. On this nodes, processes are spawned in a manner, that consecutive ranks are placed on the same node, if possible. If there are not enough CPUs available, the spawning facility will not wait for free CPUs and will also not overbook CPUs.
While starting up a parallel task, the following environment variables control the creation of the temporary node list used internally for spawning processes. Defining one of these variables enable the user to control the placement of processes. Likewise, batch systems might use this variables to place parallel jobs on dedicated nodes.
Contains a comma separated list of node ID ranges. Each node ID range consists of a single node ID (numerical value) or a range of node IDs, including both the first and last ID noted and separated by a "-".
E.g. defining the environment variable
PSI_NODES="0,1,3,17-20" # must be exported!
will enable the nodes with IDs 0, 1, 3 and 17 up to (and including) 20 to form the partition used by all subsequent parallel task.
contains a list of host names, separated by white spaces.
Defining the environment variable
PSI_HOSTS="node0 node1 node3 node17 node18 node19 node20"
will enable the nodes 0, 1, 3 and 17 up to 20 to form the partition used by all subsequent parallel task.
contains a filename listing all desired nodes by name, one per line.
A particular node can be listed several times. Depending on
further environment variables, especially
PSI_NODES_SORT, this node will be used more than
once. This behavior is important for nodes housing more than
The environment variables will be evaluated in the listed
order. The first defined variable will be used, all following
ones will be ignored. E.g., if
PSI_NODES is set,
not be recognized.
If none of these variables is set, all nodes within the cluster will be taken into account for the temporary node list.
In a next step, the previously defined temporary node list will be further checked for various constraints prohibiting the startup of processes on this nodes. In particular, the following verifications are made:
Is the node currently available? Which means, is there currently a connection to the ParaStation MPI daemon on this node?
If a node is shut down or crashed, the connections to the psid(8) on this node will time out, and this node will be declared as "dead".
Is the node supposed to run processes?
Nodes can be excluded from running compute processes by
runJobs attribute to
no for a node within the configuration
Is the node preallocated for other users or group of users?
Nodes can be preallocated for a user or a group of users by
set user or
group command while running
Is the node currently used exclusively by another task?
Is the number of "regular" processes currently running on
this node less than the maximum number of processes allowed
on this node? See
set maxproc of
Is the number of "regular" processes currently running on this node less than the number of CPUs available on this node?
ParaStation MPI will only count physical CPUs, even if Hyperthreading is enabled. For more information on physical and logical CPUs, refer to ParaStation MPI Administrator's Guide.
Is the communication hardware and the underlying protocol available?
As already mentioned, there are more environment variables influencing the selection for the temporary node list:
Only those nodes will be selected, where currently no other process is running. In addition, this nodes will be blocked for further tasks, until the current task terminates.
Normally, only as many processes can be run on a node as CPUs are available. If this environment variable is set, this limitation is no longer considered and any number of processes can be run on this node.
Currently, defining this variable will also enforce
For both variables, it is sufficient to be defined. The actual value will not be recognized.
The temporary node list will be sorted accordingly to the value
of the environment variable
if defined, or
the configuration parameter PSINodesSort
in parastation.conf(5). See also
Sorting is done according to the number of currently running processes on each node.
The node list will be sorted according to system load of the last 1 minute.
The node list will be sorted according to system load of the last 5 minutes.
The node list will be sorted according to system load of the last 15 minutes.
The node list will be sorted according to the number of currently running processes per node (PROC) and the load average for the last minute (LOAD) of these particular node.
No sorting at all is done.
If neither the environment variable
PSI_NODES_SORT is defined nor
the parameter PSINodesSort in
parastation.conf(5) is configured, the
partition table will be sorted
according to the number of processes per node (PROC). If
the variable is defined, but the value is not known, an error will
be reported. The value of this variable is not case sensitive.
Beside the listed sorting criteria(s), there are additional ones applied afterwards:
Different number of CPUs
If there are nodes with different number of CPUs, nodes with higher CPU count will be sorted before nodes with less CPU count.
Identical number of CPUs
Nodes with equal CPU count will be sorted according to there ParaStation MPI ID.
These criteria will enforce an explicit order of the temporary node list for all possible states of the particular nodes.
The real distribution of the processes on the nodes defined by
the temporary node list is controlled by two more environment
PSI_LOOP_NODES_FIRST. Depending whether this
variables are defined, the processes of a parallel task will be
spread on the temporary node list.
Beginning with the first node of the temporary node list, processes will be placed on the node as long as the current process count is less than the number of CPUs. This will happen as long as all processes are placed or the list is exhausted. If all nodes in the list are done and there are still processes to place, the startup of the parallel task will be canceled.
PSI_LOOP_NODES_FIRST is defined:
Beginning with the first node of the temporary node list, one process will be placed on each node of the list, if the number of processes on this node is less than the number of CPUs. The end of the list will wrap to the beginning. Searching the list will be done as long as there are still processes left. If there are no more nodes available where the number of processes is less than the number of CPUs, the startup of the parallel task will be canceled.
PSI_OVERBOOK is defined:
If there are at least as many "unused" CPUs on all the nodes of the temporary node list as processes to start, the behavior is identical to the action if this variable is not defined.
If there are more processes requested than unused CPUs
available, the algorithm evenly distributes the processes
on all CPUs. The actual placement is done
in the order of the temporary node list. Each node will be
filled up with the calculated number of processes. Limits
defined by the administrator, e.g.
maxproc will be enforced. If not all processes can
be placed on a node, the startup of the parallel task will
PSI_OVERBOOK are defined:
If there are enough "unused" CPUs available, the behavior
for this combination is identical to the behavior
describe previously for "
is defined". Otherwise the placement is done in a manner
that the processes are evenly distributed on all nodes in
the temporary node list. For this purpose the node list is
cyclic traversed and each time a process is placed on the
node. All defined limits will be obeyed. If it's not
possible to place all processes, the startup of the parallel
task will be canceled.
The environment variable
PSI_WAIT controls the behavior, if the startup
of the task was previously canceled due to node constrains.
If not defined, an error will be reported and the process
If this variable is defined, the startup request will be queued. Each time the resource allocation within the cluster changes, e.g. if a task terminates or a new node is detected, the startup requests queued up to now will be reevaluated as long as the next request cannot be fulfilled. Requests are queued and dequeued in a "first come first server" order. There is only one queue for the entire cluster. It is not possible for a request to bypass other requests, algorithms like "backfilling" are not implemented.
Each process and all its child processes started up on a node
may be pinned to a particular CPU-slot (= virtual core).
Therefore these processes will not interfere to each other
with respect to CPU cycles.
On process startup, the system will use the default mapping as
defined in the configuration file to map CPU-slots to physical
In case this mapping is not appropriate, the environment
__PSI_CPUMAP may be used to override
the default mapping, if user based mapping is enabled.
In case a process is creating child processes or is
using threads, the variable
PSI_TPP may be used
to define how many CPU-slots per process will be allocated and
therefore are available for a particular process.
This is in particular interesting for applications using
OpenMP. Hence, the environment variable
OMP_NUM_THREADS is also honored.
If both variables are defined,
if none of them is defined, it defaults to 1.
Process pinning may be disabled for a particular job by
defining the environment variable
__PSI_NO_PINPROC. The value itself is
By doing so, it's almost always a good idea to also disable
memory binding by defining
ps_environment(7), parastation.conf(5) and psiadmin(8)
 The term process in this chapter refers to a compute process, initiated by ParaStation MPI. Other processes running on node, e.g. system processes or daemons are not considered.