Name

ps_environment — ParaStation environment variables

Description

The behavior of the ParaStation system when starting up parallel tasks using mpiexec(1) or submitting serial jobs using psmstart(1) might be affected using different environment variables.

Further variables may be used in order to modify the behavior of the logging facilities implementing a reliable forwarding of input and output.

The last section describes some less frequently used environment variables which affect the behavior of the MPIch system implementing the MPI interface on top of ParaStation.

Variables managing the program startup

The following environment variables are used during startup of parallel tasks or while distributing serial jobs throughout a cluster. Depending on their value a splitting of the cluster into virtual partitions is done and the load balancing strategy is controlled.

OMP_NUM_THREADS

Defines the number of cores allocated for each process. May be overwritten by psenvironment_PSI_TPP.

PSI_EXCLUSIVE

Only unused nodes are considered for spawning new processes. In addition, the nodes chosen for the current job will be locked for further jobs, consequently no additional processes will be started on this nodes until the current job terminates.

This variable does not define, how many processes of a job will be placed per node. See also PSI_OVERBOOK and set maxproc of psiadmin(8).

PSI_EXPORTS VAR [, VAR]...

A list of environment variables which should be exported to remote processes during spawning. Some environment variables are exported by default: HOME, USER, SHELL, TERM, LD_LIBRARY_PATH, LD_PRELOAD, MPID_PSP_MAXSMALLMSG, In addition, all variables named PSP_*, __PSI_* or OMP_* are exported. Therefore, the variable OMP_NUM_THREADS is exported automatically.

Furthermore PWD is set correctly for remote processes. In addition, the environment used for partitioning the cluster (i.e. PSI_NODES, PSI_HOSTFILE or PSI_HOSTS and PSI_NODES_SORT) is propagated to remote processes.

PSI_NODES number [, number]...

Defines the nodes building the partition used to spawn new processes to. Depending on the variable PSI_NODES_SORT the ordering may be relevant. If the number of processes to spawn exceed the number of nodes in the partition, some nodes may get more than one process.

See also PSI_HOSTS and PSI_HOSTFILE.

PSI_HOSTS hostname [ hostname]...

Space separated list of hostnames on which new processes should be spawned on. Similar to PSI_NODES, but with hostnames instead of logical ParaStation node numbers. If PSI_NODES is set too, it is dominant over PSI_HOSTS.

See also PSI_NODES and PSI_HOSTFILE.

PSI_HOSTFILE filename

The name of a file containing a list of nodes' hostnames which should be used for spawning. Similar to PSI_HOSTS but the actual information is within the file instead of the environment variable. If PSI_NODES or PSI_HOSTS are set too, they are dominant over PSI_HOSTFILE.

See also PSI_NODES and PSI_HOSTFILE.

Note

PSI_NODES, PSI_HOSTS and PSI_HOSTFILE are evaluated in the given order. If more than one of the discussed variables is set, only the first one will be used in order to create the partition. The latter ones will be silently ignored.

PSI_LOOP_NODES_FIRST

This variable controls the behavior of ParaStation when placing processes on nodes. If PSI_LOOP_NODES_FIRST is not defined, ParaStation first of all will try to use all available CPUs on a node for the current job. If necessary, more processes will be placed on the next nodes. If PSI_LOOP_NODES_FIRST is defined, ParaStation will place one process per node, and if more processes as available nodes are requested, it will start putting an additional process on each node, as long as all processes are placed; or the placement couldn't be fullfilled, e.g. due to the fact that not enough CPUs are available.

PSI_NODES_SORT mode

This variable defines the sorting criterion used to reorder the nodes building a virtual partition. This order will be used to spawn remote processes. The following values of mode are recognized:

ROUNDROBIN

No sorting of nodes before a spawn request. The nodes are used in round robin fashion as they are set in PSI_NODES, PSI_HOSTS or PSI_HOSTFILE.

NONE

Same as ROUNDROBIN

LOAD

The nodes are sorted by load before new processes are spawned. Therefore nodes with the least load are used first.

To be more specific, the load average over the last minute is used as the sorting criterion, i.e. this option is equivalent to LOAD_1.

LOAD_1

The nodes are sorted corresponding to the 1 minute load average.

This option is equivalent to LOAD.

LOAD_5

The nodes are sorted corresponding to the 5 minute load average.

LOAD_15

The nodes are sorted corresponding to the 15 minute load average.

PROC+LOAD

The nodes are sorted corresponding to the sum of the 1 minute load and the number of running ParaStation processes. This will lead to fair load-balancing even if processes are started without notification to the ParaStation management facility.

PROC

The nodes are sorted by the number of running ParaStation processes before new processes are spawned. This is the default behavior.

PSI_OVERBOOK

If defined, more processes per node will be placed than CPUs available, if necessary. If undefined, only as many processes will be placed on a node as unused CPUs (= number(CPU) - number(currently running processes)) are available.

See also set maxproc of psiadmin(8), which takes precedence over PSI_OVERBOOK.

PSI_TPP

Defines the number of cores allocated per process. If undefined, defaults to 1.

See also psenvironment_OMP_NUM_THREADS.

PSI_WAIT

If defined, new job start request will be queued, if not enough resources are currently available. See Chapter 3 and psmstart(1) for more details.

PSI_RARG_PRE_{n}

Preceding arguments for remote processes. For example: use PSI_RARG_PRE_0=/usr/bin/time to execute the process chain /usr/bin/time <yourApplication> <yourArgs> on the remote nodes.

PMI_BARRIER_ROUNDS

This parameter defines after how many PMI_BARRIER_TMOUT cycles a job will be terminated, if not all processes have joined the PMI barrier. Defaults to 1.

The parameter should remain at the default value in production environments. This parameter's primary use is for diagnostic purposes as it allows the user to observe slower clients join an PMI barrier over multiple timeout periods. As such, the parameter helps administrators identify possible filesystem or network issues that occur on specific client nodes.

Note

PMI barriers are totally unrelated to MPI barriers! These type of barriers are typically called during MPI_INIT().

PMI_BARRIER_TMOUT

The PMI_BARRIER_TMOUT variable defines the delay (in seconds) allowed for each process to successfully join an PMI barrier. If not all processes joined, a corresponding warning is printed to stdout.

If PMI_BARRIER_TMOUT is not set, the timeout will be 60sec + (# of processes * 0.5µsec). If PMI_BARRIER_TMOUT equals -1, no barrier timeout is used and the job will not terminate because of failure to join the barrier from any one process. If PMI_BARRIER_TMOUT is set to num, then the timeout is set to num seconds.

See also ParaStation MPI Administrator's Guide.

__PSI_NO_PINPROC

If set, suppress pinning of processes, even if enabled globally (value irrelevant).

__PSI_NO_BINDMEM

If set, suppress binding to memory-node, even if enabled globally (value irrelevant).

Variables controlling the communication layer

This variables control the individual communication paths used by the pscom library. Communication paths may be different interconnects and / or protocols. In addition, tuning variables for the particular communication paths are listed.

The following table lists all currently available communication paths in descending order. Using this variables, transports may be prioritized or completely disabled. Assigning a value of 0 to a variable completely disables this communication path. Assigning a value of 2 or more prioritizes the path over all others.

Table 3. Variables controlling the pscom communication paths

Variable nameCommunication pathDescription
PSP_SHM Shared memory Used only for communication within a node. Disabled otherwise. Identical to the deprecated variable PSP_SHAREDMEM.
PSP_OPENIB InfiniBand (libopenib)  
PSP_OFED InfiniBand (libopenib) Using UD
PSP_MVAPI InfiniBand (libmvapi)  
PSP_ELAN QsNet Disabled by default.
PSP_DAPL InfiniBand (libdapl)  
PSP_GM Myrinet (libgm)  
PSP_P4S ParaStation p4sock protocol Identical to the deprecated variable PSP_P4SOCK.
PSP_TCP TCP  

Note

Not all transports may be available at run time due to missing hardware or low level libraries. Furthermore, not all transports are enabled within the precompiled packages.

PSP_LIB

Using this environment variable, it is possible to define the communication library to use, independent of the variables mentioned above. This library must match the currently available interconnect and protocol, otherwise an error will occur.

The library name must be specified using the full path and filename, e.g. PSP_LIB=/opt/parastation/lib64/libpsport4openib.so.

PSP_NETWORK network [, network]

A comma or space separated list of networks enabled to do optimized ParaStation communication using the p4sock protocol or TCP. Each network is a resolvable hostname in the chosen network, the IP address of a host in this network or the IP address of this network. The corresponding network has to be bound to a NIC of the current node.

If PSP_NETWORK is set, each network should be bound to a distinct NIC. This card then is used in order to do communication operations. If more than one network is given, the first one found to be bound to a local NIC is used.

If PSP_NETWORK is not set, ParaStation uses the NIC bound to the IP address, the local hostname resolves to.

PSP_RETRY count

Retry counter for all connect() calls within the pscom library. Default is 3.

PSP_TCP_BACKLOG count

TCP listen() backlog length. Only required for pscom library version below version 5.0.34.

The actual backlog is the minimum of PSP_TCP_BACKLOG and net.core.somaxconn, defined by the operating system.

Tuning Parameters

PSP_ONDEMAND

If set to 1, use "on demand" connections with PSP_OPENIB. This means, establish connections between ranks and allocate there associated communication buffers with the first byte send. This could cause application aborts at any time, if the application runs out of resources (e.g. a final all to one communication pattern could fail)! Default is to establish all connections at startup time (inside MPI_Init()) which assures, that there are enough resources available for all connections. If not, MPI_Init() will fail.

PSP_SO_SNDBUF, PSP_SO_RCVBUF

These variables define the TCP buffer size used for TCP sockets. Defaults to 32k.

PSP_TCP_NODELAY

If set to 1 (default), the socket option NODELAY will be used for TCP sockets.

PSP_TCP_BACKLOG

control the size of the TCP backlog when listening for new connections.

PSP_SCHED_YIELD

If set to 1, call sched_yield() in polling loops instead of busy polling. This might improve shared memory performance a lot, when there is more than one process per CPU core running, but slowdown communication performance in the common case of one process per core. (see also overbooking)

PSP_OPENIB_PATH_MTU

Control the path MTU of InfiniBand connections. Default is 3 which correspond to 1024 bytes. (1 = 256 bytes, 2 = 512 bytes, 3 = 1024 bytes)

PSP_OPENIB_SENDQ_SIZE, PSP_OPENIB_RECVQ_SIZE

These variables define the InfiniBand buffer counts used for InfiniBand connections. (Default = 16)

Variables controlling the logger/forwarder

In order to modify the behavior of the logger and the forwarders controlling the remotely spawned processes, the following environment variable can be used:

PSI_INPUTDEST rank

If set, psilogger will forward all input to the process with the corresponding rank within the process group. The default is to give all available input to process 0.

PSI_RUSAGE

If set, psilogger will print a message about the user and system time consumed by each process of the parallel task upon exit of this process.

PSI_SOURCEPRINTF

If set, psilogger gives information about the source of the received output, i.e. it will prepend every output by “[id]:”, where id is the rank of the printing process within the process group. Usually the id coincides with the MPI-rank. If PSI_LOGGERDEBUG is also set, every output is prepended by “[id, len]”, where id is the rank again and len is the length of the printed message in bytes.

PSI_NOMSGLOGGERDONE

If set, psilogger will not print out the message “PSIlogger: done” at the end of a parallel run.

PSI_LOGGERDEBUG

If set, psilogger gives debug output about connecting and detaching clients as well as received output from the clients.

PSI_FORWARDERDEBUG

If set, debug output of the psiforwarder about connected programs, received input and received output is printed.

Variables controlling MPIch

The environment variables within this section might be used less frequently. They are mainly listed within this document for completeness.

MPID_PSP_MAXSMALLMSG

Length (in bytes) of the largest message sent without rendezvous.

MPID_PSP_START

Define the method used in order to spawn remote processes. The possible values are:

PSID

Start remote processes with the ParaStation start mechanism.

This is the default. If MPID_PSP_START is not set at all, ParaStation is used in order to spawn remote processes.

SSH

Start remote processes with ssh(1). MPID_PSP_HOSTS must be set.

NONE

Do not start any remote process. The remote processes must be started manually. A commandline template is printed to stdout.

This start mode is for debugging purposes only and should not be used by the end-user.

MPID_PSP_HOSTS hostname [, hostname]...

Comma separated list of hostnames. Used for MPID_PSP_START=SSH only.

Variables controlling the TCP bypass

The environment variables within this section control the TCP bypass.

LD_PRELOAD

defines (beside others) the path to the required preload library to enable the TCP bypass. It must be set to /opt/parastation/lib64/libp4tcp.so

Variables controlling debugging

The environment variables within this section control the debug information output by ParaStation.

PSI_DEBUGMASK

defines the debug mask controlling the process management information. The following bits are defined:

Table 4. PSI_DEBUGMASK flags

Bit pattern Name Description
0x0001PSC_LOG_PARTpartitioning functions (i.e. PSpart_())
0x0002PSC_LOG_TASKtask structure handling (i.e. PStask_())
0x0004PSC_LOG_VERBVarious, less interesting messages
0x0010PSI_LOG_PARTpartition handling
0x0020PSI_LOG_SPAWNspawning
0x0040PSI_LOG_INFOinfo requests
0x0080PSI_LOG_COMMdaemon communication
0x0100PSI_LOG_VERBmore verbose stuff, e.g. function calls

These debug flags may be set as hex numbers, e.g. PSI_DEBUGMASK=0x07.

PSP_DEBUG

defines the debugging level for the ParaStation psport4 library. Higher values generally give more output.

See also

process_placement(7)