Name

parastation.conf — the ParaStation MPI configuration file

Description

Upon execution, the ParaStation MPI daemon psid(8) reads its configuration information from a configuration file which, by default, is /etc/parastation.conf. There are various parameters that can be modified persistently within this configuration file.

The main syntax of the configuration file is one parameter per line. Due to ease of use there are some parameters, e.g. Nodes, that are implemented in an environment mode. This mode enables the setting of multiple parameters by a single command. Environment mode parameters may comprise more than one line.

Line continuation is possible. If the last character within a line before the newline character is a "\", the newline character will be ignored and the next line is appended to the current line.

Comments are starting with a "#". All remaining characters on the line will be ignored. Keep in mind that line continuation also works within comments, i.e. if the last character of the line is a "\", the next line will be ignored, too.

The parser used to analyze parastation.conf is not case sensitive. This means, that all keywords within the configuration file may be written in any combination of upper- and lowercase characters. Within this document a mixed upper-/lowercase notation is used to provide more readable keywords. The same notation is used in the configuration file template parastation.conf.tmpl contained in the distributed ParaStation MPI system. The template file can be found in /opt/parastation/config.

Parameters

The different parameters are discussed in the order they should appear within the configuration file. Dependencies between parameters - resulting in a defined order of parameters - are marked explicitly.

Some parameters may be modified using different keywords, e.g. both InstallDir and InstallationDir modify the directory where the ParaStation MPI daemon psid(8) expects the ParaStation MPI system installed. In case of different keywords modifying the same resource, all keywords are mentioned in front of the parameter's discussion.

Only few parameters have to be declared in any case in order to enable ParaStation MPI to run on a cluster. These parameters are HWType and Nodes.

If parameters are declared more than once, the latest declaration is the one to use. Do not make use of this behavior as a feature since it may create great pitfalls.

InstallDir inst-dir , InstallationDir inst-dir

Tell the ParaStation MPI daemon to find all the ParaStation MPI related files in inst-dir. The default is /opt/parastation.

Hardware name

Tell the ParaStation MPI daemon how to handle a distinct hardware. Usually it is not necessary to edit these entries, since the template version of the configuration file contains up to date entries of all supported hardware types. Furthermore a deeper insight into the low-level functionality of ParaStation MPI is needed in order to create such an entry.

Nevertheless a brief overview on the structure of the Hardware entries is given here.

The following five types of parameters within the Hardware environment will get a special handling from the ParaStation MPI daemon psid(8). These define different script files called in order to execute various operations towards the corresponding communication hardware.

All these entries have the form of the parameter's name followed by the corresponding value. The value might be enclosed by single or double quotes in order to allow a space within.

The values are interpreted as absolute or relative paths. Relative paths will be looked up relative to InstallDir. If one or more of the scripts are not defined, no corresponding action will take place for this hardware.

startscript

Define a script called in order to startup the corresponding communication hardware. This script will be executed when the daemon starts up or after a reset of the communication hardware.

stopscript

Define a script called in order to shutdown the corresponding communication hardware. This script will be executed when the daemon exits or before a reset of the communication hardware.

setupscript

Define a script called in order to set special parameters on the corresponding communication hardware.

statusscript

Define a script called in order to get a status message from the corresponding communication hardware. This is mainly used in order to generate the lines shown be the status counter directive of the ParaStation MPI administration tool psiadmin(1).

headerscript

Define a script called in order to get a header line for the status message produced by the above discussed statusscript .

All further parameters defined within a Hardware section are interpreted as environment variables when calling the above defined scripts. Again these parameters have the form of the parameters name - interpreted as the environments variables name - followed by the corresponding value. The values might be single strings not containing whitespace characters or enclosed by single or double quotes, too.

The impact of the environment variables on the scripts of course depend on the scripts itself.

Various hardware types are defined within the template configuration file coming with the ParaStation MPI software distribution. These hardware types, the corresponding scripts and the environment variables the scripts understand are briefly discussed within the following lines.

Note

Shared memory will be used as hardware type for communication within a SMP node. As there are no options for this kind of hardware, no dedicated section is provided.

ethernet

Use classical TCP/IP communication over Ethernet via an optimized MPI implementation.

Since TCP/IP has to be configured before ParaStation MPI starts up, the corresponding script ps_ethernet has almost nothing to do and hence does not understand a single environment variable.

p4sock

Use optimized communication via (Gigabit) Ethernet.

The script handling this hardware type ps_p4sock is also located in the config subdirectory. It understands the following two environment variables:

PS_TCP

If set to an address range, e.g. 192.168.10.0-192.168.10.128, the TCP bypass feature of the p4sock protocol is enabled for the given address range.

openib

Use the OpenFabrics verbs layer for communication over InfiniBand.

No script is currently implemented for this communication protocol, therefore no environment variables are recognized.

mvapi

Use the Mellanox verbs layer for communication over InfiniBand.

No script is currently implemented for this communication protocol, therefore no environment variables are recognized.

gm

Use communication over GM (Myrinet).

The script ps_gm will load the Myrinet gm driver.

PS_IPENABLED

If set to 1, the IP device myri0 is enabled after loading.

elan

Use communication over QsNet (libelan).

No script is currently implemented for this communication protocol, therefore no environment variables are recognized.

This communication layer is currently not supported by the ParaStation MPI communication library, therefore only programs linked with the QsNet MPI will work.

ipath

Use communication over InfiniPath.

No script is currently implemented for this communication protocol, therefore no environment variables are recognized.

This communication layer is currently not supported by the ParaStation MPI communication library, therefore only programs linked with the InfiniPath MPI will work.

dapl

Use communication over a generic DAPL layer.

No script is currently implemented for this communication protocol, therefore no environment variables are recognized.

accounter

This is actually a pseudo communication layer. It is only used for configuring nodes running the ParaStation MPI accounting daemon and should be used only in a particular Nodes entry.

NrOfNodes num

This configuration parameter is no longer required and will be silently ignored.

HWType { ethernet | p4sock | openib | mvapi | gm | elan | dapl | none }

HWType { { ethernet | p4sock | openib | mvapi | gm | elan | dapl | none }... }

Define the default communication hardware available on the nodes of the ParaStation MPI cluster. This may be overruled by an explicit HWType option in a Node statement.

The hardware types used within this command have to be defined in Hardware declarations before.

Further hardware declarations might be defined by the user, but this is pretty much undocumented.

It is possible to enable more than one hardware type, either as default or on a per node basis.

The default value of HWType is none.

starter { true | yes | 1 | false | no | 0 }

If the argument is one of yes, true or 1, all nodes declared within a Node statement will allow to start parallel tasks, unless otherwise stated.

If the argument is one of no, false or 0, starting will be not allowed.

It might be useful to prohibit the startup of parallel task from the frontend machine if a batch system is used. This will force all users to use the batch system in order to start their tasks. Otherwise it would be possible to circumvent the batch system by starting parallel task directly from the frontend machine.

The default is to allow the starting of parallel tasks from all nodes.

runJobs { true | yes | 1 | false | no | 0 }

If the argument is one of yes, true or 1, all nodes declared within a Node statement will allow to run processes of parallel tasks, unless otherwise stated.

If the argument is one of no, false or 0, ParaStation MPI will not start processes on these nodes.

It might be useful to prohibit the start of processes on a frontend machine since usually this machine is reserved for interactive work done by the users. If the execution of processes is forbidden on a distinct node, parallel tasks might be started from this node anyhow.

The default is to allow all nodes to run processes of parallel tasks.

Node[s] hostname id [HWType-entry] [starter-entry] [runJobs-entry] [env name value] [env { name value ... }]

Node[s] { {hostname id [HWType-entry] [starter-entry] [runJobs-entry] [env name value] [env { name value ... }] }... }

Node[s] $GENERATE from-to/step nodestr idstr [HWType-entry] [starter-entry] [runJobs-entry] [env name value] [env { name value ... }]

Define one or more nodes to be part of the ParaStation MPI cluster.

This is the first example of a parameter that supports the environment mode. This means there are two different notations to use this parameter. The first one may be used to define a single node, the second one will allow to register more than one node within a single command. It is a convenient form that prevents from typing the keyword once per entry again and again.

Each entry has to have at least two items, the hostname and the id. This will tell the ParaStation MPI system that the node called hostname will act as the physical node with ParaStation MPI ID id.

hostname is either a resolvable hostname or an IP address in dot notation (e.g. 192.168.1.17). Id is an integer number in the range from 0 to maximum number of nodes minus one.

Further optional items as HWType-entry, starter-entry or runJobs-entry may overrule the default values of the hardware type on the node, the ability to start parallel jobs from this node or the possibility to run processes on this node respectively. These entries have the same syntax as the stand alone commands to set the corresponding default value.

E.g. the line

Node node17 16 HWType { ethernet p4sock } starter yes runJobs no

will define the node node17 to have the ParaStation MPI ID 16. Furthermore it is expected to have a Ethernet communication using both TCP and p4sock protocols. It is allowed to start parallel tasks from this node but the node itself will not run any process of any parallel task (except the ParaStation MPI logger processes of the tasks started on this node).

The option environment or env allows per node environment variables to be set. Using the first form, the variable name is set to value. More then one name/value pair may be given. More complex values may be given using quotation marks:

Node node17 16 environment LD_LIBRARY_PATH /mypath
Node node18 17 env { PSP_P4S "2" PSP_OPENIB "0" }

This example will define the variable LD_LIBRARY_PATH to /mypath for node node17 and the variables PSP_P4S and PSP_OPENIB to 2 and 0 for node node18.

The $GENERATE allows to define a group of nodes at once using a simple syntax. Using the parameters from and to, a range may be defined, incremented by step. Each entry in this range may be referenced within the nodestr and idstr using a syntax of $[{offset[,width[,base]]}]. Eg., the entry

$GENERATE 1-96  node${0,2} ${0}

define the nodes node01 up to node96 using the id's 1 - 96, respectively. More node specific attributes may be defined as described above.

LicenseServer hostname , LicServer hostname

LicenseFile lic-file , LicFile lic-file

LicenseDeadInterval num , LicDeadInterval num

These entries are silently ignored by this version of ParaStation MPI.

SelectTime time

Set the timeout of the central select(2) of the ParaStation MPI daemon psid(8) to time seconds.

The default value is 2 seconds.

Note

This parameter can be set during runtime via the set selecttime directive within the ParaStation MPI administration and management tool psiadmin(1).

DeadInterval num

The ParaStation MPI daemon psid(8) will declare other daemons as dead after num consecutively missing multicast pings.

After declaring a node as dead, all processes residing on this node are also declared dead. This results in sending signals to all processes on the local node that have requested to get informed about the death of one of these processes.

The default value is 10.

For now, the multicast period is set to two seconds, i.e. every daemon sends a multicast ping every two seconds. This results in declaring a daemon as dead after 20 seconds for the default value.

LogLevel num

Set the debugging level of the ParaStation MPI daemon psid(8) to num.

Note

For values of level larger than 10 the daemon logs a huge amount of message in the logging destination, which is usually the syslog(3).

This parameter can be set during runtime via the set psiddebug directive within the ParaStation MPI administration and management tool psiadmin(1).

LogDest { LOG_DAEMON | LOG_KERN | LOG_LOCAL[0-7] }

LogDestination { LOG_DAEMON | LOG_KERN | LOG_LOCAL[0-7] }

Set the logging output's destination for the ParaStation MPI daemon psid(8). Usually the daemon prints logging output using the syslog(3) mechanism, unless an alternative logging file is requested via psid(8)'s -l option.

In order to collect all the ParaStation MPI specific log messages into a special file, the facility argument of the openlog(3) function call in cooperation with a suitable setup of the syslogd(8) may be used. This parameter will set the argument to one of the mentioned values.

The default value is LOG_DAEMON.

MCastGroup group-num

Tell psid(8) to use the multicast group group-num for multicast communication to other daemons.

The default group to use is 237

MCastPort portno

Tell psid(8) to use the UDP port portno for multicast communication to other daemons.

The default port to use is 1889

RDPPort portno

Tell psid(8) to use the UDP port portno for the RDP communication protocol to other daemons.

The default port to use is 886.

RLimit { Core size | CPUTime time | DataSize size | MemLock size | StackSize size | RSSize size | NoFile num }

RLimit { { Core size | CPUTime time | DataSize size | MemLock size | StackSize size | RSSize size | NoFile num }... }

Set various resource limits to the psid(8) and thus to all processes started from it.

All limits are set using the setrlimit(2) system call. For a detailed description of the different types of limits please refer to the corresponding manual page.

If no RLimits are set within the ParaStation MPI configuration files, no changes are made to the systems default value.

The following (soft) resource limits may be set:

Core size

Set the maximum size of a core-file to size kilobytes. size is an integer number, the string “infinity” or the string “unlimited”. In the two latter cases the data size is set to RLIM_INFINITY.

Note

Starting with version 5.0.3, this configuration will also control the writing of core-files for the psid itself, in case a catastrophic failure occurs.

CPUTime time

Set the maximum CPU time that might be consumed by the daemon to time seconds. time has to be an integer number, the string “infinity” or the string “unlimited”. In the two latter cases the data size is set to RLIM_INFINITY.

DataSize size

Set the maximum data size to size kilobytes. size is an integer number, the string “infinity” or the string “unlimited”. In the two latter cases the data size is set to RLIM_INFINITY.

MemLock size

Set the maximum amount of memory that might be locked into RAM to size kilobytes. size is an integer number, the string “infinity” or the string “unlimited”. In the two latter cases the data size is set to RLIM_INFINITY.

StackSize size

Set the maximum stack size to size kilobytes. size is an integer number, the string “infinity” or the string “unlimited”. In the two latter cases the stack is set to RLIM_INFINITY.

RSSize size

Set the maximum Resident Set Size (RSS) to size pages. size is an integer number, the string “infinity” or the string “unlimited”. In the two latter cases the RSS is set to RLIM_INFINITY.

NoFile num

Set the maximum number of open files to num. Be aware of the fact that inherited limits are confined by psid's hard limits.

Env | Environment name value

Env | Environment { {name value }... }

Set environment variables for the ParaStation MPI daemon psid(8) and any application started via this daemon.

This command again has two different modes. While within the first form exactly one variable is set, within the environment form of this command as many variables as wanted may be set. The general form of the latter case is one variable per line.

The value part of each line either is a single word or an expression enclosed by single or double quotes. The expression might contain whitespace characters. If the expression is enclosed by single quotes, it is allowed to use balanced or unbalanced double quotes within this expression and vice versa.

This command might be used for example in order to set the PSP_NETWORK environment variable globally without the need of every user to adjust this parameter in his own environment.

freeOnSuspend { true | yes | 1 | false | no | 0 }

If the argument is one of yes, true or 1, suspending a task by sending the signal SIGTSTP to the logger will handle all resources (CPUs) currently claimed by this task as free.

If the argument is one of no, false or 0, ParaStation MPI will not claim resources as free after sending SIGTSTP.

handleOldBins { true | yes | 1 | false | no | 0 }

If the argument is one of yes, true or 1, compatibility mode for applications linked with ParaStation MPI version 4.0 up to 4.0.6 will be enabled. Keep in mind that this behavior might collide with the freeOnSuspend feature.

If the argument is one of no, false or 0, ParaStation MPI will disable compatibility mode.

UseMCast { true | yes | 1 | false | no | 0 }

If the argument is one of yes, true or 1, keep alive messages from the ParaStation MPI daemon psid(8) are sent using Multicast messages.

If the argument is one of no, false or 0, ParaStation MPI will use it's own RDP protocol for keep alive messages. This is the default.

PSINodesSort { PROC | LOAD_1 | LOAD_5 | LOAD_15 | PROC+LOAD | NONE }

Define the default sorting strategy for nodes when attaching them to a partition. The different possible values have the following meaning:

PROC

Sort by the number of processes managed by ParaStation MPI on the corresponding nodes

LOAD_1

Sort by the load average during the last minute on the corresponding nodes

LOAD_5

Sort by the load average during the last 5 minutes on the corresponding nodes

LOAD_15

Sort by the load average during the last 15 minutes on the corresponding nodes

PROC+LOAD

Sort conforming to the sum of the processes managed by ParaStation MPI and the load average during the last minute on the corresponding nodes

NONE

Do not sort at all.

This only comes into play, if the user does not define a sorting strategy explicitly via PSI_NODES_SORT. Be aware of the fact that using a batch-system like PBS or LSF *will* set the strategy explicitly, namely to NONE.

overbook { true | yes | 1 | false | no | 0 }

If the argument is one of yes, true or 1, all nodes may be overbooked by the user using the PSI_OVERBOOK environment variable.

If the argument is one of no, false or 0, ParaStation MPI will deny overbooking of the nodes, even if PSI_OVERBOOK is set.

It might be useful to prohibit the start of processes on a frontend machine since usually this machine is reserved for interactive work done by the users. When the execution of processes is forbidden on a distinct node, parallel task might be started from this node anyhow.

The default is to allow all nodes to run processes of parallel tasks.

processes maxprocs

Define the maximum number of processes per node.

This parameter can be set during runtime via the set maxproc directive within the ParaStation MPI administration and management tool psiadmin(1).

pinProcs { true | yes | 1 | false | no | 0 }

Enables or disables process pinning for compute tasks. If enabled, tasks will be pinned down to particular CPU-slots. The mapping between those CPU-slots and physical CPUs and cores is made using a mapping list. See CPUmap below.

The pinProcs parameter can be set during runtime via the set pinprocs directive within the ParaStation MPI administration and management tool psiadmin(1).

bindMem { true | yes | 1 | false | no | 0 }

This parameter must be set to true if nodes providing non-Uniform memory access (NUMA) should use 'local' memory for the tasks.

This parameter can be set during runtime via the set bindmem directive within the ParaStation MPI administration and management tool psiadmin(1).

CPUmap { map }

Set the map used to assign CPU-slots to physical cores to map. Map is a quoted string containing a space-separated permutation of the number 0 to Ncore-1. Here Ncore is the number of physical cores available on this node. The number of cores within a distinct node may be determined via list hw. The first number in map is the number of the physical core the first CPU-slot will be mapped to, and so on.

This parameter can be set during runtime via the set bindmem directive within the ParaStation MPI administration and management tool psiadmin(1).

supplGrps { true | yes | 1 | false | no | 0 }

This parameter must be set to true if processes spawned by ParaStation MPI should belong to all groups defined for this user. Otherwise, they will only belong to the primary group.

This parameter can be set during runtime via the set supplementaryGroups directive within the ParaStation MPI administration and management tool psiadmin(1).

rdpMaxRetrans number

Set the maximum number of retransmissions within the RDP facility. If more than this number of retransmission would have been necessary to deliver the packet to the remote destination, this connection is declared to be down.

See also psiadmin(1).

statusBroadcasts number

Set the maximum number of status broadcasts per round. This is used to limit the number of status-broadcasts per status-iteration. Too many broadcast might lead to running out of message-buffers within RDP on huge clusters.

If more than this number of broadcasts are triggered during one status-iteration, all future broadcasts will be ignored. The corresponding counter is reset upon start of the next status iteration.

A value of 0 will completely suppress sending of status-broadcasts. In this case information on dead nodes will be propagated by sending ACTIVENODES messages upon receive of too many wrong LOAD messages, only.

Only relevant, if MCast is *not* used.

See also psiadmin(1).

rdpTimeout ms

The timeout of the actual timer registered by RDP in milliseconds. Each time the corresponding timer is elapsed, handleTimeoutRDP() is called handling all resend activities necessary. This parameter steers the actual load introduced by RDP. Within the daemon, there is a lower limit for all timeout-timers of 100 msec. Thus, the minimal value here is 100, too.

deadLimit number

Dead-limit of the RDP status module. After this number of consecutively missing RDP-pings the master declares the node to be dead.

Only relevant, if MCast is *not* used.

statusTimeout ms

Timeout of the RDP status module. After this number of milliseconds a RDP-ping is sent to the master daemon. Additionally, the master daemon checks for received ping-messages. Within the daemon, there is a lower limit for all timeout-timers of 100 msec. Thus, the minimal value here is 100, too.

Only relevant, if MCast is *not* used.

rdpClosedTimeout ms

The closed timeout within the RDP facility in milliseconds. If a RDP-connection is closed, during this timeout all messages from the corresponding partner are ignored. Thus, reconnection is avoided during this period. This helps handling packets still on the wire on connection close.

rdpResendTimeout ms

The resend timeout within the RDP facility in milliseconds. If a pending message is available and not yet acknowledged, this is the timeout after which the message is retransmitted to the remote host.

rdpMaxACKPend number

The maximum number of pending ACKs within the RDP facility. If this number of packets is received from a remote node consecutively without any retransmission, an explicit ACK is sent. Otherwise the ACK is sent piggyback within the next regular packet to this node or as soon as a retransmission occurred.

If set to 1, each RDP packet received is acknowledged by an explicit ACK.

pinProcs { true | yes | 1 | false | no | 0 }

Enable pinning of processes to distinct processor cores.

bindMem { true | yes | 1 | false | no | 0 }

Enable binding of processes to distinct memory nodes on NUMA systems.

CPUmap { list of ids }

Define map assigning logical process-slots to physical processor cores.

allowUserMap { true | yes | 1 | false | no | 0 }

Enable users to influence local mapping of processes via providing a __PSI_CPUMAP environment variable.

startupScript script

This script is called during startup of the ParaStation daemon. It's either an absolute path or relative to InstallDir. Parsing the configuration file will fail, if the script is not found. Depending on its return value the daemon continues startup without any action (0), some output of the script is written to the daemons log (-1) or the daemon stops immediately (-2).

As a default no script is defined and, thus, nothing is called.

maxStatTry num

Maximum number of tries to stat() an executable before spawning new processes. Increasing this number might help on overloaded NFS-servers.

RDPstatistics { true | yes | 1 | false | no | 0 }

Flag the RDP statistics. If set to 1, statistics on total number of messages sent and mean-time to ACK are determined per connection.

nodeUpScript script

This script is called by the currently elected master of the ParaStation daemons every time a node becomes active from ParaStation's point of view. I.e. the node's daemon connects to the master daemon for the first time after being down.

This might be used to pass this type of information into a batch-system or some monitoring facility like the GridMonitor.

While calling the script two additional arguments are passed within the environment:

NODE_NAME

The hostname of the node that appeared. The actual value is the IP address implicitly defined in the Nodes-section of this file resolved to the official name of the host using gethostbyaddr().

NODE_ID

The ParaStation MPI ID of the node that appeared.

As a default no script is defined and, thus, nothing is called.

nodeDownScript script

This script is called by the currently elected master of the ParaStation daemons every time a node goes down from ParaStation's point of view. I.e. the node's daemon disconnects from the master daemon after being connected.

This might be used to pass this type of information into a batch-system or some monitoring facility like the GridMonitor.

While calling the script two additional arguments are passed within the environment:

NODE_NAME

The hostname of the node that appeared. The actual value is the IP address implicitly defined in the Nodes-section of this file resolved to the official name of the host using gethostbyaddr().

NODE_ID

The ParaStation MPI ID of the node that appeared.

As a default no script is defined and, thus, nothing is called.

Errors

No known errors.

See also

psid(8), psiadmin(1)