parastation.conf — the ParaStation MPI configuration file
Upon execution, the ParaStation MPI daemon psid(8) reads its configuration
information from a configuration file which, by default, is
/etc/parastation.conf
. There are various parameters
that can be modified persistently within this configuration file.
The main syntax of the configuration file is one parameter per line. Due to ease of use there are some parameters, e.g. Nodes, that are implemented in an environment mode. This mode enables the setting of multiple parameters by a single command. Environment mode parameters may comprise more than one line.
Line continuation is possible. If the last character within a line before the newline character is a "\", the newline character will be ignored and the next line is appended to the current line.
Comments are starting with a "#". All remaining characters on the line will be ignored. Keep in mind that line continuation also works within comments, i.e. if the last character of the line is a "\", the next line will be ignored, too.
The parser used to analyze parastation.conf
is
not case sensitive. This means, that all
keywords within the configuration file may be written in any combination
of upper- and lowercase characters. Within this document a mixed
upper-/lowercase notation is used to provide more readable keywords. The
same notation is used in the configuration file template
parastation.conf.tmpl
contained in the distributed
ParaStation MPI system. The template file can be found in
/opt/parastation/config
.
The different parameters are discussed in the order they should appear within the configuration file. Dependencies between parameters - resulting in a defined order of parameters - are marked explicitly.
Some parameters may be modified using different keywords, e.g. both InstallDir and InstallationDir modify the directory where the ParaStation MPI daemon psid(8) expects the ParaStation MPI system installed. In case of different keywords modifying the same resource, all keywords are mentioned in front of the parameter's discussion.
Only few parameters have to be declared in any case in order to enable ParaStation MPI to run on a cluster. These parameters are HWType and Nodes.
If parameters are declared more than once, the latest declaration is the one to use. Do not make use of this behavior as a feature since it may create great pitfalls.
InstallDir inst-dir
, InstallationDir inst-dir
Tell the ParaStation MPI daemon to find all the ParaStation MPI related files in
. The
default is inst-dir
/opt/parastation
.
Hardware name
Tell the ParaStation MPI daemon how to handle a distinct hardware. Usually it is not necessary to edit these entries, since the template version of the configuration file contains up to date entries of all supported hardware types. Furthermore a deeper insight into the low-level functionality of ParaStation MPI is needed in order to create such an entry.
Nevertheless a brief overview on the structure of the Hardware entries is given here.
The following five types of parameters within the Hardware environment will get a special handling from the ParaStation MPI daemon psid(8). These define different script files called in order to execute various operations towards the corresponding communication hardware.
All these entries have the form of the parameter's name followed by the corresponding value. The value might be enclosed by single or double quotes in order to allow a space within.
The values are interpreted as absolute or relative paths. Relative
paths will be looked up relative to
. If one
or more of the scripts are not defined, no corresponding action
will take place for this hardware.
InstallDir
startscript
Define a script called in order to startup the corresponding communication hardware. This script will be executed when the daemon starts up or after a reset of the communication hardware.
stopscript
Define a script called in order to shutdown the corresponding communication hardware. This script will be executed when the daemon exits or before a reset of the communication hardware.
setupscript
Define a script called in order to set special parameters on the corresponding communication hardware.
statusscript
Define a script called in order to get a status message
from the corresponding communication hardware. This is mainly
used in order to generate the lines shown be the
status counter
directive
of the ParaStation MPI administration tool psiadmin(1).
headerscript
Define a script called in order to get a header line for the status message produced by the above discussed statusscript .
All further parameters defined within a Hardware section are interpreted as environment variables when calling the above defined scripts. Again these parameters have the form of the parameters name - interpreted as the environments variables name - followed by the corresponding value. The values might be single strings not containing whitespace characters or enclosed by single or double quotes, too.
The impact of the environment variables on the scripts of course depend on the scripts itself.
Various hardware types are defined within the template configuration file coming with the ParaStation MPI software distribution. These hardware types, the corresponding scripts and the environment variables the scripts understand are briefly discussed within the following lines.
Shared memory will be used as hardware type for communication within a SMP node. As there are no options for this kind of hardware, no dedicated section is provided.
ethernet
Use classical TCP/IP communication over Ethernet via an optimized MPI implementation.
Since TCP/IP has to be configured before ParaStation MPI starts up, the corresponding script ps_ethernet has almost nothing to do and hence does not understand a single environment variable.
p4sock
Use optimized communication via (Gigabit) Ethernet.
The script handling this hardware type
ps_p4sock is also located in the
config
subdirectory. It understands the
following two environment variables:
PS_TCP
If set to an address range, e.g. 192.168.10.0-192.168.10.128, the TCP bypass feature of the p4sock protocol is enabled for the given address range.
openib
Use the OpenFabrics verbs layer for communication over InfiniBand.
No script is currently implemented for this communication protocol, therefore no environment variables are recognized.
mvapi
Use the Mellanox verbs layer for communication over InfiniBand.
No script is currently implemented for this communication protocol, therefore no environment variables are recognized.
gm
Use communication over GM (Myrinet).
The script ps_gm will load the Myrinet gm driver.
PS_IPENABLED
If set to 1, the IP device myri0 is enabled after loading.
elan
Use communication over QsNet (libelan).
No script is currently implemented for this communication protocol, therefore no environment variables are recognized.
This communication layer is currently not supported by the ParaStation MPI communication library, therefore only programs linked with the QsNet MPI will work.
ipath
Use communication over InfiniPath.
No script is currently implemented for this communication protocol, therefore no environment variables are recognized.
This communication layer is currently not supported by the ParaStation MPI communication library, therefore only programs linked with the InfiniPath MPI will work.
dapl
Use communication over a generic DAPL layer.
No script is currently implemented for this communication protocol, therefore no environment variables are recognized.
accounter
This is actually a pseudo communication layer. It is only used for configuring nodes running the ParaStation MPI accounting daemon and should be used only in a particular Nodes entry.
NrOfNodes num
This configuration parameter is no longer required and will be silently ignored.
HWType { ethernet | p4sock | openib | mvapi | gm | elan | dapl | none }
HWType { { ethernet | p4sock | openib | mvapi | gm | elan | dapl | none }... }
Define the default communication hardware available on the
nodes of the ParaStation MPI cluster. This may be overruled by an explicit
HWType
option in a Node
statement.
The hardware types used within this command have to be defined in Hardware declarations before.
Further hardware declarations might be defined by the user, but this is pretty much undocumented.
It is possible to enable more than one hardware type, either as default or on a per node basis.
The default value of HWType is
none
.
starter { true | yes | 1 | false | no | 0 }
If the argument is one of yes
,
true
or 1
, all nodes
declared within a Node statement will allow to
start parallel tasks, unless otherwise stated.
If the argument is one of no
,
false
or 0
, starting
will be not allowed.
It might be useful to prohibit the startup of parallel task from the frontend machine if a batch system is used. This will force all users to use the batch system in order to start their tasks. Otherwise it would be possible to circumvent the batch system by starting parallel task directly from the frontend machine.
The default is to allow the starting of parallel tasks from all nodes.
runJobs { true | yes | 1 | false | no | 0 }
If the argument is one of yes
,
true
or 1
, all nodes
declared within a Node statement will allow to
run processes of parallel tasks, unless otherwise stated.
If the argument is one of no
,
false
or 0
, ParaStation MPI will
not start processes on these nodes.
It might be useful to prohibit the start of processes on a frontend machine since usually this machine is reserved for interactive work done by the users. If the execution of processes is forbidden on a distinct node, parallel tasks might be started from this node anyhow.
The default is to allow all nodes to run processes of parallel tasks.
Node[s] hostname
id
[HWType-entry] [starter-entry] [runJobs-entry] [env name
value
] [env { name
value
... }]
Node[s] { {hostname
id
[HWType-entry] [starter-entry] [runJobs-entry] [env name
value
] [env { name
value
... }] }... }
Node[s] $GENERATE from
-to
/step
nodestr
idstr
[HWType-entry] [starter-entry] [runJobs-entry] [env name
value
] [env { name
value
... }]
Define one or more nodes to be part of the ParaStation MPI cluster.
This is the first example of a parameter that supports the environment mode. This means there are two different notations to use this parameter. The first one may be used to define a single node, the second one will allow to register more than one node within a single command. It is a convenient form that prevents from typing the keyword once per entry again and again.
Each entry has to have at least two items, the
and the
hostname
. This will
tell the ParaStation MPI system that the node called
id
will act
as the physical node with ParaStation MPI ID
hostname
.
id
is
either a resolvable hostname or an IP address in dot notation (e.g.
192.168.1.17). hostname
is an integer number in the range from 0 to
maximum number of nodes minus one.
Id
Further optional items as HWType-entry
,
starter-entry
or runJobs-entry
may overrule the default values of the hardware type on the node,
the ability to start parallel jobs from this node or the
possibility to run processes on this node respectively. These entries
have the same syntax as the stand alone commands to set the
corresponding default value.
E.g. the line
Node node17 16 HWType { ethernet p4sock } starter yes runJobs no
will define the node node17
to have the ParaStation MPI
ID 16. Furthermore it is expected to have a Ethernet
communication using both TCP and
p4sock protocols. It
is allowed to start parallel tasks from this node but the node itself
will not run any process of any parallel task (except the ParaStation MPI
logger processes of the tasks started on this node).
The option environment
or
env
allows per node environment variables
to be set.
Using the first form, the variable
name
is set to
value
.
More then one name/value pair may be given.
More complex values
may be
given using quotation marks:
Node node17 16 environment LD_LIBRARY_PATH /mypath Node node18 17 env { PSP_P4S "2" PSP_OPENIB "0" }
This example will define the variable
LD_LIBRARY_PATH
to
/mypath
for node
node17
and the variables
PSP_P4S
and
PSP_OPENIB
to 2
and 0
for node
node18
.
The $GENERATE
allows to define a group of
nodes at once using a simple syntax.
Using the parameters from
and
to
, a range may be defined,
incremented by step
.
Each entry in this range may be referenced within the
nodestr
and
idstr
using a syntax of
$[{offset[,width[,base]]}]
.
Eg., the entry
$GENERATE 1-96 node${0,2} ${0}
define the nodes node01
up to
node96
using the id's 1 - 96,
respectively.
More node specific attributes may be defined as described
above.
LicenseServer hostname
, LicServer hostname
LicenseFile lic-file
, LicFile lic-file
LicenseDeadInterval num
, LicDeadInterval num
These entries are silently ignored by this version of ParaStation MPI.
SelectTime time
Set the timeout of the central select(2) of the ParaStation MPI daemon psid(8) to
seconds.
time
The default value is 2 seconds.
This parameter can be set during runtime via the set
selecttime
directive within the ParaStation MPI
administration and management tool psiadmin(1).
DeadInterval num
The ParaStation MPI daemon psid(8) will declare other
daemons as dead after
consecutively
missing multicast pings.
num
After declaring a node as dead, all processes residing on this node are also declared dead. This results in sending signals to all processes on the local node that have requested to get informed about the death of one of these processes.
The default value is 10.
For now, the multicast period is set to two seconds, i.e. every daemon sends a multicast ping every two seconds. This results in declaring a daemon as dead after 20 seconds for the default value.
LogLevel num
Set the debugging level of the ParaStation MPI daemon psid(8) to num
.
For values of level
larger than
10
the daemon logs a huge amount of message
in the logging destination, which is usually the syslog(3).
This parameter can be set during runtime via the set
psiddebug
directive within the ParaStation MPI
administration and management tool psiadmin(1).
LogDest { LOG_DAEMON | LOG_KERN | LOG_LOCAL[0-7] }
LogDestination { LOG_DAEMON | LOG_KERN | LOG_LOCAL[0-7] }
Set the logging output's destination for the ParaStation MPI daemon
psid(8). Usually the daemon prints logging output
using the syslog(3) mechanism, unless an alternative logging file is
requested via psid(8)'s -l
option.
In order to collect all the ParaStation MPI specific log messages into a
special file, the facility
argument of
the openlog(3) function call in cooperation with a suitable setup
of the syslogd(8) may be used. This parameter will set the argument
to one of the mentioned values.
The default value is LOG_DAEMON.
MCastGroup group-num
Tell psid(8) to use
the multicast group
for
multicast communication to other daemons.
group-num
The default group to use is 237
MCastPort portno
Tell psid(8) to use
the UDP port
for
multicast communication to other daemons.
portno
The default port to use is 1889
RDPPort portno
Tell psid(8) to use the UDP port
for the RDP communication protocol to other daemons.
portno
The default port to use is 886
.
RLimit { Core size
| CPUTime time
| DataSize
size
| MemLock size
| StackSize
size
| RSSize size
| NoFile num
}
RLimit { { Core size
| CPUTime
time
| DataSize
size
| MemLock
size
| StackSize
size
| RSSize
size
| NoFile
num
}... }
Set various resource limits to the psid(8) and thus to all processes started from it.
All limits are set using the setrlimit(2) system call. For a detailed description of the different types of limits please refer to the corresponding manual page.
If no RLimits are set within the ParaStation MPI configuration files, no changes are made to the systems default value.
The following (soft) resource limits may be set:
Core size
Set the maximum size of a core-file to
kilobytes.
size
is an
integer number, the string “infinity” or the
string “unlimited”. In the two latter cases the
data size is set to RLIM_INFINITY.
size
Starting with version 5.0.3, this configuration will also control the writing of core-files for the psid itself, in case a catastrophic failure occurs.
CPUTime time
Set the maximum CPU time that might be consumed by the
daemon to
seconds.
time
has to
be an integer number, the string “infinity” or
the string “unlimited”. In the two latter cases
the data size is set to RLIM_INFINITY.
time
DataSize size
Set the maximum data size to
kilobytes.
size
is an
integer number, the string “infinity” or the
string “unlimited”. In the two latter cases the
data size is set to RLIM_INFINITY.
size
MemLock size
Set the maximum amount of memory that might be locked
into RAM to
kilobytes.
size
is an
integer number, the string “infinity” or the
string “unlimited”. In the two latter cases the
data size is set to RLIM_INFINITY.
size
StackSize size
Set the maximum stack size to
kilobytes.
size
is an
integer number, the string “infinity” or the
string “unlimited”. In the two latter cases the
stack is set to RLIM_INFINITY.
size
RSSize size
Set the maximum Resident Set Size (RSS) to
pages.
size
is an
integer number, the string “infinity” or the
string “unlimited”. In the two latter cases the
RSS is set to RLIM_INFINITY.
size
NoFile num
Set the maximum number of open files to
.
Be aware of the fact that inherited limits are
confined by psid's hard limits.
num
Env | Environment name
value
Env | Environment { {name
value
}... }
Set environment variables for the ParaStation MPI daemon psid(8) and any application started via this daemon.
This command again has two different modes. While within the first form exactly one variable is set, within the environment form of this command as many variables as wanted may be set. The general form of the latter case is one variable per line.
The value
part of each line either is a
single word or an expression enclosed by single or double quotes.
The expression might contain whitespace characters. If the
expression is enclosed by single quotes, it is allowed to use
balanced or unbalanced double quotes within this expression and
vice versa.
This command might be used for example in order to set the
PSP_NETWORK
environment variable globally without
the need of every user to adjust this parameter in his own
environment.
freeOnSuspend { true | yes | 1 | false | no | 0 }
If the argument is one of yes
,
true
or 1
,
suspending a task by sending the signal
SIGTSTP
to the logger will handle all
resources (CPUs) currently claimed by this task as free.
If the argument is one of no
,
false
or 0
, ParaStation MPI will
not claim resources as free after sending
SIGTSTP
.
handleOldBins { true | yes | 1 | false | no | 0 }
If the argument is one of yes
,
true
or 1
,
compatibility mode for applications linked with
ParaStation MPI version 4.0 up to 4.0.6 will be enabled. Keep in mind that
this behavior might collide with the
freeOnSuspend feature.
If the argument is one of no
,
false
or 0
, ParaStation MPI will
disable compatibility mode.
UseMCast { true | yes | 1 | false | no | 0 }
If the argument is one of yes
,
true
or 1
,
keep alive messages from the ParaStation MPI daemon psid(8) are sent using Multicast messages.
If the argument is one of no
,
false
or 0
, ParaStation MPI will
use it's own RDP protocol for keep alive messages. This is
the default.
PSINodesSort { PROC | LOAD_1 | LOAD_5 | LOAD_15 | PROC+LOAD | NONE }
Define the default sorting strategy for nodes when attaching them to a partition. The different possible values have the following meaning:
PROC
Sort by the number of processes managed by ParaStation MPI on the corresponding nodes
LOAD_1
Sort by the load average during the last minute on the corresponding nodes
LOAD_5
Sort by the load average during the last 5 minutes on the corresponding nodes
LOAD_15
Sort by the load average during the last 15 minutes on the corresponding nodes
PROC+LOAD
Sort conforming to the sum of the processes managed by ParaStation MPI and the load average during the last minute on the corresponding nodes
NONE
Do not sort at all.
This only comes into play, if the user does not define a
sorting strategy explicitly via PSI_NODES_SORT
. Be
aware of the fact that using a batch-system like PBS or LSF *will*
set the strategy explicitly, namely to NONE.
overbook { true | yes | 1 | false | no | 0 }
If the argument is one of yes
,
true
or 1
, all
nodes may be overbooked by the user using the
PSI_OVERBOOK
environment variable.
If the argument is one of no
,
false
or 0
,
ParaStation MPI will deny overbooking of the nodes, even if
PSI_OVERBOOK
is set.
It might be useful to prohibit the start of processes on a frontend machine since usually this machine is reserved for interactive work done by the users. When the execution of processes is forbidden on a distinct node, parallel task might be started from this node anyhow.
The default is to allow all nodes to run processes of parallel tasks.
processes maxprocs
Define the maximum number of processes per node.
This parameter can be set during runtime via the set
maxproc
directive within the ParaStation MPI
administration and management tool psiadmin(1).
pinProcs { true | yes | 1 | false | no | 0 }
Enables or disables process pinning for compute tasks. If enabled, tasks will be pinned down to particular CPU-slots. The mapping between those CPU-slots and physical CPUs and cores is made using a mapping list. See CPUmap below.
The pinProcs parameter can be set
during runtime via the set
pinprocs
directive within the
ParaStation MPI administration and management tool psiadmin(1).
bindMem { true | yes | 1 | false | no | 0 }
This parameter must be set to true if nodes providing non-Uniform memory access (NUMA) should use 'local' memory for the tasks.
This parameter can be set during runtime via the set
bindmem
directive within the ParaStation MPI
administration and management tool psiadmin(1).
CPUmap { map }
Set the map used to assign CPU-slots to physical cores
to map
.
Map
is a quoted string containing a
space-separated permutation of the number 0 to
Ncore
-1.
Here Ncore
is the number of
physical cores available on this node.
The number of cores within a distinct node may be
determined via list hw.
The first number in map
is the
number of the physical core the first CPU-slot will be
mapped to, and so on.
This parameter can be set during runtime via the set
bindmem
directive within the ParaStation MPI
administration and management tool psiadmin(1).
supplGrps { true | yes | 1 | false | no | 0 }
This parameter must be set to true if processes spawned by ParaStation MPI should belong to all groups defined for this user. Otherwise, they will only belong to the primary group.
This parameter can be set during runtime via the set
supplementaryGroups
directive
within the ParaStation MPI administration and management tool psiadmin(1).
rdpMaxRetrans number
Set the maximum number of retransmissions within the RDP facility. If more than this number of retransmission would have been necessary to deliver the packet to the remote destination, this connection is declared to be down.
See also psiadmin(1).
statusBroadcasts number
Set the maximum number of status broadcasts per round. This is used to limit the number of status-broadcasts per status-iteration. Too many broadcast might lead to running out of message-buffers within RDP on huge clusters.
If more than this number of broadcasts are triggered during one status-iteration, all future broadcasts will be ignored. The corresponding counter is reset upon start of the next status iteration.
A value of 0 will completely suppress sending of status-broadcasts. In this case information on dead nodes will be propagated by sending ACTIVENODES messages upon receive of too many wrong LOAD messages, only.
Only relevant, if MCast is *not* used.
See also psiadmin(1).
rdpTimeout ms
The timeout of the actual timer registered by RDP in milliseconds. Each time the corresponding timer is elapsed, handleTimeoutRDP() is called handling all resend activities necessary. This parameter steers the actual load introduced by RDP. Within the daemon, there is a lower limit for all timeout-timers of 100 msec. Thus, the minimal value here is 100, too.
deadLimit number
Dead-limit of the RDP status module. After this number of consecutively missing RDP-pings the master declares the node to be dead.
Only relevant, if MCast is *not* used.
statusTimeout ms
Timeout of the RDP status module. After this number of milliseconds a RDP-ping is sent to the master daemon. Additionally, the master daemon checks for received ping-messages. Within the daemon, there is a lower limit for all timeout-timers of 100 msec. Thus, the minimal value here is 100, too.
Only relevant, if MCast is *not* used.
rdpClosedTimeout ms
The closed timeout within the RDP facility in milliseconds. If a RDP-connection is closed, during this timeout all messages from the corresponding partner are ignored. Thus, reconnection is avoided during this period. This helps handling packets still on the wire on connection close.
rdpResendTimeout ms
The resend timeout within the RDP facility in milliseconds. If a pending message is available and not yet acknowledged, this is the timeout after which the message is retransmitted to the remote host.
rdpMaxACKPend number
The maximum number of pending ACKs within the RDP facility. If this number of packets is received from a remote node consecutively without any retransmission, an explicit ACK is sent. Otherwise the ACK is sent piggyback within the next regular packet to this node or as soon as a retransmission occurred.
If set to 1, each RDP packet received is acknowledged by an explicit ACK.
pinProcs { true | yes | 1 | false | no | 0 }
Enable pinning of processes to distinct processor cores.
bindMem { true | yes | 1 | false | no | 0 }
Enable binding of processes to distinct memory nodes on NUMA systems.
CPUmap { list of ids }
Define map assigning logical process-slots to physical processor cores.
allowUserMap { true | yes | 1 | false | no | 0 }
Enable users to influence local mapping of processes via
providing a __PSI_CPUMAP
environment
variable.
startupScript script
This script is called during startup of the ParaStation
daemon.
It's either an absolute path or relative to
.
Parsing the configuration file will fail, if the script is
not found.
Depending on its return value the daemon continues
startup without any action (0), some output of the script
is written to the daemons log (-1) or the daemon stops
immediately (-2).
InstallDir
As a default no script is defined and, thus, nothing is called.
maxStatTry num
Maximum number of tries to stat() an executable before spawning new processes. Increasing this number might help on overloaded NFS-servers.
RDPstatistics { true | yes | 1 | false | no | 0 }
Flag the RDP statistics. If set to 1, statistics on total number of messages sent and mean-time to ACK are determined per connection.
nodeUpScript script
This script is called by the currently elected master of the ParaStation daemons every time a node becomes active from ParaStation's point of view. I.e. the node's daemon connects to the master daemon for the first time after being down.
This might be used to pass this type of information into a batch-system or some monitoring facility like the GridMonitor.
While calling the script two additional arguments are passed within the environment:
The hostname of the node that appeared. The actual
value is the IP address implicitly defined in the
Nodes-section of this file resolved to the official
name of the host using
gethostbyaddr()
.
The ParaStation MPI ID of the node that appeared.
As a default no script is defined and, thus, nothing is called.
nodeDownScript script
This script is called by the currently elected master of the ParaStation daemons every time a node goes down from ParaStation's point of view. I.e. the node's daemon disconnects from the master daemon after being connected.
This might be used to pass this type of information into a batch-system or some monitoring facility like the GridMonitor.
While calling the script two additional arguments are passed within the environment:
The hostname of the node that appeared. The actual
value is the IP address implicitly defined in the
Nodes-section of this file resolved to the official
name of the host using
gethostbyaddr()
.
The ParaStation MPI ID of the node that appeared.
As a default no script is defined and, thus, nothing is called.