ps_environment — ParaStation environment variables
The behavior of the ParaStation system when starting up parallel tasks using mpiexec(1) or submitting serial jobs using psmstart(1) might be affected using different environment variables.
Further variables may be used in order to modify the behavior of the logging facilities implementing a reliable forwarding of input and output.
The last section describes some less frequently used environment variables which affect the behavior of the MPIch system implementing the MPI interface on top of ParaStation.
The following environment variables are used during startup of parallel tasks or while distributing serial jobs throughout a cluster. Depending on their value a splitting of the cluster into virtual partitions is done and the load balancing strategy is controlled.
OMP_NUM_THREADS
Defines the number of cores allocated for each process.
May be overwritten by
psenvironment_PSI_TPP.
PSI_EXCLUSIVEOnly unused nodes are considered for spawning new processes. In addition, the nodes chosen for the current job will be locked for further jobs, consequently no additional processes will be started on this nodes until the current job terminates.
This variable does not define, how many processes of a job
will be placed per node. See also
PSI_OVERBOOK and
set maxproc of
psiadmin(8).
PSI_EXPORTS VAR
[, VAR]...A list of environment variables which should be exported to
remote processes during spawning. Some environment variables
are exported by default:
HOME,
USER,
SHELL,
TERM,
LD_LIBRARY_PATH,
LD_PRELOAD,
MPID_PSP_MAXSMALLMSG,
In addition, all variables named PSP_*,
__PSI_* or OMP_* are exported.
Therefore, the variable OMP_NUM_THREADS is
exported automatically.
Furthermore PWD is set correctly for remote
processes. In addition, the environment used for partitioning the
cluster (i.e. PSI_NODES, PSI_HOSTFILE
or PSI_HOSTS and PSI_NODES_SORT) is
propagated to remote processes.
PSI_NODES number
[, number]...Defines the nodes building the partition used to spawn new
processes to. Depending on the variable
PSI_NODES_SORT the ordering may be relevant. If the
number of processes to spawn exceed the number of nodes in the
partition, some nodes may get more than one process.
See also PSI_HOSTS and PSI_HOSTFILE.
PSI_HOSTS hostname
[ hostname]...Space separated list of hostnames on which new processes should
be spawned on. Similar to PSI_NODES, but with
hostnames instead of logical ParaStation node numbers. If
PSI_NODES is set too, it is dominant over
PSI_HOSTS.
See also PSI_NODES and PSI_HOSTFILE.
PSI_HOSTFILE
filenameThe name of a file containing a list of nodes' hostnames which
should be used for spawning. Similar to PSI_HOSTS
but the actual information is within the file instead of the
environment variable. If PSI_NODES or
PSI_HOSTS are set too, they are dominant over
PSI_HOSTFILE.
See also PSI_NODES and PSI_HOSTFILE.
PSI_NODES, PSI_HOSTS and
PSI_HOSTFILE are evaluated in the given order. If
more than one of the discussed variables is set, only the first
one will be used in order to create the partition. The latter
ones will be silently ignored.
PSI_LOOP_NODES_FIRST
This variable controls the behavior of ParaStation when placing
processes on nodes.
If PSI_LOOP_NODES_FIRST is not defined,
ParaStation first of all will try to use all available CPUs on a
node for the current job.
If necessary, more processes will be placed on the next
nodes.
If PSI_LOOP_NODES_FIRST is defined, ParaStation
will place one process per node, and if more processes as
available nodes are requested, it will start putting an
additional process on each node, as long as all processes
are placed; or the placement couldn't be fullfilled, e.g.
due to the fact that not enough CPUs are available.
PSI_NODES_SORT
modeThis variable defines the sorting criterion used to
reorder the nodes building a virtual partition. This
order will be used to spawn remote processes. The following values
of mode are recognized:
ROUNDROBINNo sorting of nodes before a spawn request. The nodes are
used in round robin fashion as they are set in
PSI_NODES, PSI_HOSTS or
PSI_HOSTFILE.
NONESame as ROUNDROBIN
LOADThe nodes are sorted by load before new processes are spawned. Therefore nodes with the least load are used first.
To be more specific, the load average over the last minute is
used as the sorting criterion, i.e. this option is equivalent
to LOAD_1.
LOAD_1The nodes are sorted corresponding to the 1 minute load average.
This option is equivalent to LOAD.
LOAD_5The nodes are sorted corresponding to the 5 minute load average.
LOAD_15The nodes are sorted corresponding to the 15 minute load average.
PROC+LOADThe nodes are sorted corresponding to the sum of the 1 minute load and the number of running ParaStation processes. This will lead to fair load-balancing even if processes are started without notification to the ParaStation management facility.
PROCThe nodes are sorted by the number of running ParaStation processes before new processes are spawned. This is the default behavior.
PSI_OVERBOOKIf defined, more processes per node will be placed than CPUs available, if necessary. If undefined, only as many processes will be placed on a node as unused CPUs (= number(CPU) - number(currently running processes)) are available.
See also
set maxproc of
psiadmin(8), which takes precedence over
PSI_OVERBOOK.
PSI_TPPDefines the number of cores allocated per process. If undefined, defaults to 1.
See also psenvironment_OMP_NUM_THREADS.
PSI_WAITIf defined, new job start request will be queued, if not enough resources are currently available. See Chapter 3 and psmstart(1) for more details.
PSI_RARG_PRE_{n}Preceding arguments for remote processes. For example: use
PSI_RARG_PRE_0=/usr/bin/time
to execute the process chain /usr/bin/time
<yourApplication> <yourArgs> on the
remote nodes.
PMI_BARRIER_ROUNDS
This parameter defines after how many
PMI_BARRIER_TMOUT cycles a job will be
terminated, if not all processes have joined the PMI
barrier.
Defaults to 1.
The parameter should remain at the default value in production environments. This parameter's primary use is for diagnostic purposes as it allows the user to observe slower clients join an PMI barrier over multiple timeout periods. As such, the parameter helps administrators identify possible filesystem or network issues that occur on specific client nodes.
PMI barriers are totally unrelated to MPI barriers!
These type of barriers are typically called during
MPI_INIT().
PMI_BARRIER_TMOUT
The PMI_BARRIER_TMOUT variable defines the
delay (in seconds) allowed for each process to
successfully join an PMI barrier.
If not all processes joined, a corresponding warning is
printed to stdout.
If PMI_BARRIER_TMOUT is not set, the
timeout will be 60sec + (# of processes * 0.5µsec).
If PMI_BARRIER_TMOUT equals
-1, no barrier timeout is used and
the job will not terminate because of failure to join the
barrier from any one process.
If PMI_BARRIER_TMOUT is set to
num, then the timeout is set to
num seconds.
See also ParaStation MPI Administrator's Guide.
__PSI_NO_PINPROCIf set, suppress pinning of processes, even if enabled globally (value irrelevant).
__PSI_NO_BINDMEMIf set, suppress binding to memory-node, even if enabled globally (value irrelevant).
This variables control the individual communication paths used
by the pscom library.
Communication paths may be different interconnects and / or
protocols.
In addition, tuning variables for the particular communication
paths are listed.
The following table lists all currently available communication
paths in descending order.
Using this variables, transports may be prioritized or completely
disabled.
Assigning a value of 0 to a variable
completely disables this communication path.
Assigning a value of 2 or more prioritizes
the path over all others.
Table 3. Variables controlling the pscom communication paths
| Variable name | Communication path | Description |
|---|---|---|
PSP_SHM | Shared memory |
Used only for communication within a node.
Disabled otherwise.
Identical to the deprecated variable
PSP_SHAREDMEM.
|
PSP_OPENIB | InfiniBand (libopenib) | |
PSP_OFED | InfiniBand (libopenib) | Using UD |
PSP_MVAPI | InfiniBand (libmvapi) | |
PSP_ELAN | QsNet | Disabled by default. |
PSP_DAPL | InfiniBand (libdapl) | |
PSP_GM | Myrinet (libgm) | |
PSP_P4S | ParaStation p4sock protocol |
Identical to the deprecated variable
PSP_P4SOCK.
|
PSP_TCP | TCP |
Not all transports may be available at run time due to missing hardware or low level libraries. Furthermore, not all transports are enabled within the precompiled packages.
PSP_LIBUsing this environment variable, it is possible to define the communication library to use, independent of the variables mentioned above. This library must match the currently available interconnect and protocol, otherwise an error will occur.
The library name must be specified using the full path and
filename, e.g.
PSP_LIB=/opt/parastation/lib64/libpsport4openib.so.
PSP_NETWORK network
[, network]
A comma or space separated list of networks enabled to do
optimized ParaStation communication using the p4sock protocol or
TCP.
Each network is a resolvable
hostname in the chosen network, the IP address of a host
in this network or the IP address of this network.
The corresponding network has to be bound to a NIC of the current node.
If PSP_NETWORK is set, each
network should be bound to a distinct
NIC. This card then is used in order to do
communication operations. If more than one
network is given, the first one found to
be bound to a local NIC is used.
If PSP_NETWORK is not set, ParaStation uses the NIC bound to the IP address, the local hostname
resolves to.
PSP_RETRY count
Retry counter for all connect()
calls within the pscom library.
Default is 3.
PSP_TCP_BACKLOG count
TCP listen() backlog length.
Only required for pscom library
version below version 5.0.34.
The actual backlog is the minimum of
PSP_TCP_BACKLOG and
net.core.somaxconn, defined by the
operating system.
Tuning Parameters
PSP_ONDEMAND
If set to 1, use "on demand" connections with
PSP_OPENIB. This means,
establish connections between ranks and allocate there
associated communication buffers with the first byte
send. This could cause application aborts at any time, if the
application runs out of resources (e.g. a final all to
one communication pattern could fail)!
Default is to establish all connections at startup time
(inside MPI_Init()) which assures, that there are enough
resources available for all connections. If not,
MPI_Init() will fail.
PSP_SO_SNDBUF,
PSP_SO_RCVBUFThese variables define the TCP buffer size used for TCP sockets. Defaults to 32k.
PSP_TCP_NODELAY
If set to 1 (default), the socket option
NODELAY will be used for TCP sockets.
PSP_TCP_BACKLOGcontrol the size of the TCP backlog when listening for new connections.
PSP_SCHED_YIELDIf set to 1, call sched_yield() in polling loops instead of busy polling. This might improve shared memory performance a lot, when there is more than one process per CPU core running, but slowdown communication performance in the common case of one process per core. (see also overbooking)
PSP_OPENIB_PATH_MTUControl the path MTU of InfiniBand connections. Default is 3 which correspond to 1024 bytes. (1 = 256 bytes, 2 = 512 bytes, 3 = 1024 bytes)
PSP_OPENIB_SENDQ_SIZE,
PSP_OPENIB_RECVQ_SIZEThese variables define the InfiniBand buffer counts used for InfiniBand connections. (Default = 16)
In order to modify the behavior of the logger and the forwarders controlling the remotely spawned processes, the following environment variable can be used:
PSI_INPUTDEST
rankIf set, psilogger will forward all input to the process with the corresponding rank within the process group. The default is to give all available input to process 0.
PSI_RUSAGEIf set, psilogger will print a message about the user and system time consumed by each process of the parallel task upon exit of this process.
PSI_SOURCEPRINTFIf set, psilogger gives information about the source of the
received output, i.e. it will prepend every output by
“[id]:”, where id is the rank of the printing process
within the process group. Usually the id coincides with the
MPI-rank. If PSI_LOGGERDEBUG is also set, every
output is prepended by “[id, len]”, where id is the
rank again and len is the length of the printed message in bytes.
PSI_NOMSGLOGGERDONEIf set, psilogger will not print out the message “PSIlogger: done” at the end of a parallel run.
PSI_LOGGERDEBUGIf set, psilogger gives debug output about connecting and detaching clients as well as received output from the clients.
PSI_FORWARDERDEBUGIf set, debug output of the psiforwarder about connected programs, received input and received output is printed.
The environment variables within this section might be used less frequently. They are mainly listed within this document for completeness.
MPID_PSP_MAXSMALLMSGLength (in bytes) of the largest message sent without rendezvous.
MPID_PSP_STARTDefine the method used in order to spawn remote processes. The possible values are:
PSIDStart remote processes with the ParaStation start mechanism.
This is the default. If MPID_PSP_START is
not set at all, ParaStation is used in order to spawn remote
processes.
SSHStart remote processes with ssh(1). MPID_PSP_HOSTS must be
set.
NONEDo not start any remote process. The remote processes must be started manually. A commandline template is printed to stdout.
This start mode is for debugging purposes only and should not be used by the end-user.
MPID_PSP_HOSTS hostname
[, hostname]...Comma separated list of hostnames. Used for
MPID_PSP_START=
only.SSH
The environment variables within this section control the TCP bypass.
LD_PRELOADdefines (beside others) the path to the required
preload library to enable the TCP bypass. It must be set to
/opt/parastation/lib64/libp4tcp.so
The environment variables within this section control the debug information output by ParaStation.
PSI_DEBUGMASKdefines the debug mask controlling the process management information. The following bits are defined:
Table 4. PSI_DEBUGMASK flags
| Bit pattern | Name | Description |
|---|---|---|
0x0001 | PSC_LOG_PART | partitioning functions (i.e. PSpart_()) |
0x0002 | PSC_LOG_TASK | task structure handling (i.e. PStask_()) |
0x0004 | PSC_LOG_VERB | Various, less interesting messages |
0x0010 | PSI_LOG_PART | partition handling |
0x0020 | PSI_LOG_SPAWN | spawning |
0x0040 | PSI_LOG_INFO | info requests |
0x0080 | PSI_LOG_COMM | daemon communication |
0x0100 | PSI_LOG_VERB | more verbose stuff, e.g. function calls |
These debug flags may be set as hex numbers, e.g.
PSI_DEBUGMASK=0x07.
PSP_DEBUGdefines the debugging level for the ParaStation
psport4 library. Higher values
generally give more output.