ps_environment — ParaStation environment variables
The behavior of the ParaStation system when starting up parallel tasks using mpiexec(1) or submitting serial jobs using psmstart(1) might be affected using different environment variables.
Further variables may be used in order to modify the behavior of the logging facilities implementing a reliable forwarding of input and output.
The last section describes some less frequently used environment variables which affect the behavior of the MPIch system implementing the MPI interface on top of ParaStation.
The following environment variables are used during startup of parallel tasks or while distributing serial jobs throughout a cluster. Depending on their value a splitting of the cluster into virtual partitions is done and the load balancing strategy is controlled.
OMP_NUM_THREADS
Defines the number of cores allocated for each process.
May be overwritten by
psenvironment_PSI_TPP
.
PSI_EXCLUSIVE
Only unused nodes are considered for spawning new processes. In addition, the nodes chosen for the current job will be locked for further jobs, consequently no additional processes will be started on this nodes until the current job terminates.
This variable does not define, how many processes of a job
will be placed per node. See also
PSI_OVERBOOK
and
set maxproc of
psiadmin(8).
PSI_EXPORTS
VAR
[, VAR
]...A list of environment variables which should be exported to
remote processes during spawning. Some environment variables
are exported by default:
HOME
,
USER
,
SHELL
,
TERM
,
LD_LIBRARY_PATH
,
LD_PRELOAD
,
MPID_PSP_MAXSMALLMSG
,
In addition, all variables named PSP_*
,
__PSI_*
or OMP_*
are exported.
Therefore, the variable OMP_NUM_THREADS
is
exported automatically.
Furthermore PWD
is set correctly for remote
processes. In addition, the environment used for partitioning the
cluster (i.e. PSI_NODES
, PSI_HOSTFILE
or PSI_HOSTS
and PSI_NODES_SORT
) is
propagated to remote processes.
PSI_NODES
number
[, number
]...Defines the nodes building the partition used to spawn new
processes to. Depending on the variable
PSI_NODES_SORT
the ordering may be relevant. If the
number of processes to spawn exceed the number of nodes in the
partition, some nodes may get more than one process.
See also PSI_HOSTS
and PSI_HOSTFILE
.
PSI_HOSTS
hostname
[ hostname
]...Space separated list of hostnames on which new processes should
be spawned on. Similar to PSI_NODES
, but with
hostnames instead of logical ParaStation node numbers. If
PSI_NODES
is set too, it is dominant over
PSI_HOSTS
.
See also PSI_NODES
and PSI_HOSTFILE
.
PSI_HOSTFILE
filename
The name of a file containing a list of nodes' hostnames which
should be used for spawning. Similar to PSI_HOSTS
but the actual information is within the file instead of the
environment variable. If PSI_NODES
or
PSI_HOSTS
are set too, they are dominant over
PSI_HOSTFILE
.
See also PSI_NODES
and PSI_HOSTFILE
.
PSI_NODES
, PSI_HOSTS
and
PSI_HOSTFILE
are evaluated in the given order. If
more than one of the discussed variables is set, only the first
one will be used in order to create the partition. The latter
ones will be silently ignored.
PSI_LOOP_NODES_FIRST
This variable controls the behavior of ParaStation when placing
processes on nodes.
If PSI_LOOP_NODES_FIRST
is not defined,
ParaStation first of all will try to use all available CPUs on a
node for the current job.
If necessary, more processes will be placed on the next
nodes.
If PSI_LOOP_NODES_FIRST
is defined, ParaStation
will place one process per node, and if more processes as
available nodes are requested, it will start putting an
additional process on each node, as long as all processes
are placed; or the placement couldn't be fullfilled, e.g.
due to the fact that not enough CPUs are available.
PSI_NODES_SORT
mode
This variable defines the sorting criterion used to
reorder the nodes building a virtual partition. This
order will be used to spawn remote processes. The following values
of mode
are recognized:
ROUNDROBIN
No sorting of nodes before a spawn request. The nodes are
used in round robin fashion as they are set in
PSI_NODES
, PSI_HOSTS
or
PSI_HOSTFILE
.
NONE
Same as ROUNDROBIN
LOAD
The nodes are sorted by load before new processes are spawned. Therefore nodes with the least load are used first.
To be more specific, the load average over the last minute is
used as the sorting criterion, i.e. this option is equivalent
to LOAD_1
.
LOAD_1
The nodes are sorted corresponding to the 1 minute load average.
This option is equivalent to LOAD
.
LOAD_5
The nodes are sorted corresponding to the 5 minute load average.
LOAD_15
The nodes are sorted corresponding to the 15 minute load average.
PROC+LOAD
The nodes are sorted corresponding to the sum of the 1 minute load and the number of running ParaStation processes. This will lead to fair load-balancing even if processes are started without notification to the ParaStation management facility.
PROC
The nodes are sorted by the number of running ParaStation processes before new processes are spawned. This is the default behavior.
PSI_OVERBOOK
If defined, more processes per node will be placed than CPUs available, if necessary. If undefined, only as many processes will be placed on a node as unused CPUs (= number(CPU) - number(currently running processes)) are available.
See also
set maxproc of
psiadmin(8), which takes precedence over
PSI_OVERBOOK
.
PSI_TPP
Defines the number of cores allocated per process. If undefined, defaults to 1.
See also psenvironment_OMP_NUM_THREADS
.
PSI_WAIT
If defined, new job start request will be queued, if not enough resources are currently available. See Chapter 3 and psmstart(1) for more details.
PSI_RARG_PRE_{n}
Preceding arguments for remote processes. For example: use
PSI_RARG_PRE_0
=/usr/bin/time
to execute the process chain /usr/bin/time
<yourApplication> <yourArgs>
on the
remote nodes.
PMI_BARRIER_ROUNDS
This parameter defines after how many
PMI_BARRIER_TMOUT
cycles a job will be
terminated, if not all processes have joined the PMI
barrier.
Defaults to 1.
The parameter should remain at the default value in production environments. This parameter's primary use is for diagnostic purposes as it allows the user to observe slower clients join an PMI barrier over multiple timeout periods. As such, the parameter helps administrators identify possible filesystem or network issues that occur on specific client nodes.
PMI barriers are totally unrelated to MPI barriers!
These type of barriers are typically called during
MPI_INIT()
.
PMI_BARRIER_TMOUT
The PMI_BARRIER_TMOUT
variable defines the
delay (in seconds) allowed for each process to
successfully join an PMI barrier.
If not all processes joined, a corresponding warning is
printed to stdout.
If PMI_BARRIER_TMOUT
is not set, the
timeout will be 60sec + (# of processes * 0.5µsec).
If PMI_BARRIER_TMOUT
equals
-1
, no barrier timeout is used and
the job will not terminate because of failure to join the
barrier from any one process.
If PMI_BARRIER_TMOUT
is set to
num
, then the timeout is set to
num
seconds.
See also ParaStation MPI Administrator's Guide.
__PSI_NO_PINPROC
If set, suppress pinning of processes, even if enabled globally (value irrelevant).
__PSI_NO_BINDMEM
If set, suppress binding to memory-node, even if enabled globally (value irrelevant).
This variables control the individual communication paths used
by the pscom
library.
Communication paths may be different interconnects and / or
protocols.
In addition, tuning variables for the particular communication
paths are listed.
The following table lists all currently available communication
paths in descending order.
Using this variables, transports may be prioritized or completely
disabled.
Assigning a value of 0
to a variable
completely disables this communication path.
Assigning a value of 2
or more prioritizes
the path over all others.
Table 3. Variables controlling the pscom communication paths
Variable name | Communication path | Description |
---|---|---|
PSP_SHM | Shared memory |
Used only for communication within a node.
Disabled otherwise.
Identical to the deprecated variable
PSP_SHAREDMEM .
|
PSP_OPENIB | InfiniBand (libopenib) | |
PSP_OFED | InfiniBand (libopenib) | Using UD |
PSP_MVAPI | InfiniBand (libmvapi) | |
PSP_ELAN | QsNet | Disabled by default. |
PSP_DAPL | InfiniBand (libdapl) | |
PSP_GM | Myrinet (libgm) | |
PSP_P4S | ParaStation p4sock protocol |
Identical to the deprecated variable
PSP_P4SOCK .
|
PSP_TCP | TCP |
Not all transports may be available at run time due to missing hardware or low level libraries. Furthermore, not all transports are enabled within the precompiled packages.
PSP_LIB
Using this environment variable, it is possible to define the communication library to use, independent of the variables mentioned above. This library must match the currently available interconnect and protocol, otherwise an error will occur.
The library name must be specified using the full path and
filename, e.g.
PSP_LIB=/opt/parastation/lib64/libpsport4openib.so
.
PSP_NETWORK
network
[, network
]
A comma or space separated list of networks enabled to do
optimized ParaStation communication using the p4sock protocol or
TCP.
Each network
is a resolvable
hostname in the chosen network, the IP address of a host
in this network or the IP address of this network.
The corresponding network has to be bound to a NIC of the current node.
If PSP_NETWORK
is set, each
network
should be bound to a distinct
NIC. This card then is used in order to do
communication operations. If more than one
network
is given, the first one found to
be bound to a local NIC is used.
If PSP_NETWORK
is not set, ParaStation uses the NIC bound to the IP address, the local hostname
resolves to.
PSP_RETRY
count
Retry counter for all connect()
calls within the pscom
library.
Default is 3
.
PSP_TCP_BACKLOG
count
TCP listen()
backlog length.
Only required for pscom
library
version below version 5.0.34.
The actual backlog is the minimum of
PSP_TCP_BACKLOG
and
net.core.somaxconn
, defined by the
operating system.
Tuning Parameters
PSP_ONDEMAND
If set to 1, use "on demand" connections with
PSP_OPENIB
. This means,
establish connections between ranks and allocate there
associated communication buffers with the first byte
send. This could cause application aborts at any time, if the
application runs out of resources (e.g. a final all to
one communication pattern could fail)!
Default is to establish all connections at startup time
(inside MPI_Init()) which assures, that there are enough
resources available for all connections. If not,
MPI_Init() will fail.
PSP_SO_SNDBUF
,
PSP_SO_RCVBUF
These variables define the TCP buffer size used for TCP sockets. Defaults to 32k.
PSP_TCP_NODELAY
If set to 1 (default), the socket option
NODELAY
will be used for TCP sockets.
PSP_TCP_BACKLOG
control the size of the TCP backlog when listening for new connections.
PSP_SCHED_YIELD
If set to 1, call sched_yield() in polling loops instead of busy polling. This might improve shared memory performance a lot, when there is more than one process per CPU core running, but slowdown communication performance in the common case of one process per core. (see also overbooking)
PSP_OPENIB_PATH_MTU
Control the path MTU of InfiniBand connections. Default is 3 which correspond to 1024 bytes. (1 = 256 bytes, 2 = 512 bytes, 3 = 1024 bytes)
PSP_OPENIB_SENDQ_SIZE
,
PSP_OPENIB_RECVQ_SIZE
These variables define the InfiniBand buffer counts used for InfiniBand connections. (Default = 16)
In order to modify the behavior of the logger and the forwarders controlling the remotely spawned processes, the following environment variable can be used:
PSI_INPUTDEST
rank
If set, psilogger will forward all input to the process with the corresponding rank within the process group. The default is to give all available input to process 0.
PSI_RUSAGE
If set, psilogger will print a message about the user and system time consumed by each process of the parallel task upon exit of this process.
PSI_SOURCEPRINTF
If set, psilogger gives information about the source of the
received output, i.e. it will prepend every output by
“[id]:”, where id is the rank of the printing process
within the process group. Usually the id coincides with the
MPI-rank. If PSI_LOGGERDEBUG
is also set, every
output is prepended by “[id, len]”, where id is the
rank again and len is the length of the printed message in bytes.
PSI_NOMSGLOGGERDONE
If set, psilogger will not print out the message “PSIlogger: done” at the end of a parallel run.
PSI_LOGGERDEBUG
If set, psilogger gives debug output about connecting and detaching clients as well as received output from the clients.
PSI_FORWARDERDEBUG
If set, debug output of the psiforwarder about connected programs, received input and received output is printed.
The environment variables within this section might be used less frequently. They are mainly listed within this document for completeness.
MPID_PSP_MAXSMALLMSG
Length (in bytes) of the largest message sent without rendezvous.
MPID_PSP_START
Define the method used in order to spawn remote processes. The possible values are:
PSID
Start remote processes with the ParaStation start mechanism.
This is the default. If MPID_PSP_START
is
not set at all, ParaStation is used in order to spawn remote
processes.
SSH
Start remote processes with ssh(1). MPID_PSP_HOSTS
must be
set.
NONE
Do not start any remote process. The remote processes must be started manually. A commandline template is printed to stdout.
This start mode is for debugging purposes only and should not be used by the end-user.
MPID_PSP_HOSTS
hostname
[, hostname
]...Comma separated list of hostnames. Used for
MPID_PSP_START=
only.SSH
The environment variables within this section control the TCP bypass.
LD_PRELOAD
defines (beside others) the path to the required
preload library to enable the TCP bypass. It must be set to
/opt/parastation/lib64/libp4tcp.so
The environment variables within this section control the debug information output by ParaStation.
PSI_DEBUGMASK
defines the debug mask controlling the process management information. The following bits are defined:
Table 4. PSI_DEBUGMASK flags
Bit pattern | Name | Description |
---|---|---|
0x0001 | PSC_LOG_PART | partitioning functions (i.e. PSpart_()) |
0x0002 | PSC_LOG_TASK | task structure handling (i.e. PStask_()) |
0x0004 | PSC_LOG_VERB | Various, less interesting messages |
0x0010 | PSI_LOG_PART | partition handling |
0x0020 | PSI_LOG_SPAWN | spawning |
0x0040 | PSI_LOG_INFO | info requests |
0x0080 | PSI_LOG_COMM | daemon communication |
0x0100 | PSI_LOG_VERB | more verbose stuff, e.g. function calls |
These debug flags may be set as hex numbers, e.g.
PSI_DEBUGMASK=0x07
.
PSP_DEBUG
defines the debugging level for the ParaStation
psport4
library. Higher values
generally give more output.