ParaStation MPI run-time tuning parameters

This section describes those environment variables that can be used to modify the behavior of MPI at run-time. This section does not include those environment that influence the spawning of processes and the placement of jobs. Those details are included in Chapter 3 and the ps_environment(7) manual page.

General scaling tuning variables

A couple of variables are available to tune ParaStation MPI for large systems and jobs.

PSP_TCP_BACKLOG

This variable defines the maximum backlog to the listen() call. For pscom versions below 5.0.34, this should be set to the number of cores within the system. For newer pscom versions, the default should be sufficient for all current systems.

The actual value of listen() is also limited by the system to the value of net.core.somaxconn, which should also be set by the administrator to at least the number of cores within the system. See sysctl(8) for details.

PMI_BARRIER_TMOUT , PMI_BARRIER_ROUND

These variables define the initial and total timeout for PMI barrier during job startup. See ps_environment(5) for details.

The default values for both variables should be sufficient, even for very large systems. In case of appropriate warnings, they might be adjusted to circumvent network flaws.

InfiniBand specific tuning variables

Since reliable InfiniBand connections require a lot of memory for every connection, these communication path do not scale well with the number of connections. In that scenario, the memory footprint on each node is impacted significantly because every node (in the entire job) needs to store, in memory, N connection contexts, and allocate N sets of send/receive buffers. N is the total number of MPI ranks for that entire user job.

A better understanding of the memory footprint issues can be gained by using real life examples. Consider a system in which a single user is permitted to submit a single MPI job of 512 compute nodes - with each node having 24GB of available memory. If each of those compute nodes has 8 cores (2 quad core sockets) - then that would allow 4095 single threaded MPI tasks within a single user job.

By default, each InfiniBand connection in ParaStation MPI consumes 0.55 MB of memory for context information and send/receive queue buffering. As such, each core (or MPI rank) would need to reserve 4096 * 0.55MB of memory. That equates to 2GB of memory occupied just by the MPI layer per core. However, since there is only 3GB of memory available per core, this memory footprint is unacceptable. The next section details environment variables that can alleviate this problem.

MPI tuning for connected service types - addressing scalability problems with InfiniBand

ParaStation MPI uses by default the Reliable Connected (RC) InfiniBand service. ParaStation MPI also assumes that the majority of jobs include some all-to-all operations. Therefore all connections will be established inside MPI_Init() to be prepared for this type of communication. Unfortunately this has the scalability drawback described in the previous section.

However, informed users who have an understanding of their application's communication requirements can influence the connection behavior of the InfiniBand fabric to achieve scalability improvements. In particular, the memory footprint left by the implementation of connected service types can be significantly reduced using parameters detailed below.

PSP_ONDEMAND=[0 | 1]

mpiexec -ondemand is the equivalent command line argument that has the same effect as PSP_ONDEMAND=1.

Users who have examined their application code and have determined that it doesn't use all-to-all communication patterns can use "on-demand" connections. Using this setting, connections are only established when they are required by the application. If two ranks don't communicate with each other directly, then no connection will be generated. This mechanism will save host memory by not reserving QP buffers between ranks that never communicate but are nonetheless part of the same job.

Note

If your application is known to use all-to-all communication patterns, the PSP_ONDEMAND variable will NOT reduce memory usage and should not be used!

Note

Incorrect use of this flag can result in application failure due to out-of-memory errors. This may arise when late all-to-all communication QPs are established after the application itself has consumed much of the host memory. Users are therefore advised to use this environment variable with caution.

If your applications use just the following communication patterns:

  • Nearest neighbor communication

  • Scatter/Gather and All-Reduce, Alltoall collectives based on binary trees

It is appropriate to set the PSP_ONDEMAND=1 provided the communication patterns used in a users MPI code are limited to the above types.

In applications where all-to-all communication is required, there are still some tuning parameters available to improve scaling behavior. Again, these parameters focus on the reduction of the memory footprint caused by excessive numbers of private send/receive message buffers.

PSP_OPENIB_SENDQ_SIZE=num / PSP_OPENIB_RECVQ_SIZE=num

The default buffer settings for the each send/receive queue (for which there is 1 per connection) is to allocate sixteen 16kb buffers. The size of the buffer itself is not tunable (set at 16kb), but the number of buffers per connection (for the send and receive queues respectively) is a tunable parameter.

Examples of setting the number of the send/receive buffers per connection are shown below:

PSP_OPENIB_SENDQ_SIZE=3

PSP_OPENIB_RECVQ_SIZE=3

Using the above parameters, the size of the send/rec queue buffers are reduced from 0.55 MB/connection (when both values are set to the default of 16) to 0.14 MB/connection. Reducing either of the above parameters to a value of less that 3 will cause deadlocks.

Note

Users should also be aware that reducing the number of buffers will effect the message issue rate and overall throughput of messages within their application. Users are encouraged to test various values of send/receive size with a view to determining, by experiment, which values produce the best all-round performance.