fi_psm2 - The PSM2 Fabric Provider
The psm2 provider runs over the PSM 2.x interface that is
supported by the Intel Omni-Path Fabric. PSM 2.x has all the PSM 1.x
features plus a set of new functions with enhanced capabilities. Since PSM
1.x and PSM 2.x are not ABI compatible the psm2 provider only works
with PSM 2.x and doesn't support Intel TrueScale Fabric.
The psm2 provider doesn't support all the features defined
in the libfabric API. Here are some of the limitations:
- Endpoint
types
- Only support non-connection based types FI_DGRAM and
FI_RDM
- Endpoint
capabilities
- Endpoints can support any combination of data transfer capabilities
FI_TAGGED, FI_MSG, FI_ATOMICS, and FI_RMA.
These capabilities can be further refined by FI_SEND,
FI_RECV, FI_READ, FI_WRITE, FI_REMOTE_READ,
and FI_REMOTE_WRITE to limit the direction of operations.
FI_MULTI_RECV is supported for non-tagged message queue
only.
Scalable endpoints are supported if the underlying PSM2 library
supports multiple endpoints. This condition must be satisfied both when the
provider is built and when the provider is used. See the Scalable
endpoints section for more information.
Other supported capabilities include FI_TRIGGER,
FI_REMOTE_CQ_DATA, FI_RMA_EVENT, FI_SOURCE, and
FI_SOURCE_ERR. Furthermore, FI_NAMED_RX_CTX is supported when
scalable endpoints are enabled.
- Modes
- FI_CONTEXT is required for the FI_TAGGED and FI_MSG
capabilities. That means, any request belonging to these two categories
that generates a completion must pass as the operation context a valid
pointer to type struct fi_context, and the space referenced by the
pointer must remain untouched until the request has completed. If none of
FI_TAGGED and FI_MSG is asked for, the FI_CONTEXT
mode is not required.
- Progress
- The psm2 provider requires manual progress. The application is
expected to call fi_cq_read or fi_cntr_read function from
time to time when no other libfabric function is called to ensure progress
is made in a timely manner. The provider does support auto progress mode.
However, the performance can be significantly impacted if the application
purely depends on the provider to make auto progress.
- Scalable
endpoints
- Scalable endpoints support depends on the multi-EP feature of the
PSM2 library. If the PSM2 library supports this feature, the
availability is further controlled by an environment variable
PSM2_MULTI_EP. The psm2 provider automatically sets this
variable to 1 if it is not set. The feature can be disabled explicitly by
setting PSM2_MULTI_EP to 0.
When creating a scalable endpoint, the exact number of contexts
requested should be set in the "fi_info" structure passed to the
fi_scalable_ep function. This number should be set in
"fi_info->ep_attr->tx_ctx_cnt" or
"fi_info->ep_attr->rx_ctx_cnt" or both, whichever greater is
used. The psm2 provider allocates all requested contexts upfront when
the scalable endpoint is created. The same context is used for both Tx and
Rx.
For optimal performance, it is advised to avoid having multiple
threads accessing the same context, either directly by posting
send/recv/read/write request, or indirectly by polling associated completion
queues or counters.
Using the scalable endpoint as a whole in communication functions
is not supported. Instead, individual tx context or rx context of the
scalable endpoint should be used. Similarly, using the address of the
scalable endpoint as the source address or destination address doesn't
collectively address all the tx/rx contexts. It addresses only the first
tx/rx context, instead.
- Shared Tx
contexts
- In order to achieve the purpose of saving PSM context by using shared Tx
context, the endpoints bound to the shared Tx contexts need to be Tx only.
The reason is that Rx capability always requires a PSM context, which can
also be automatically used for Tx. As the result, allocating a shared Tx
context for Rx capable endpoints actually consumes one extra context
instead of saving some.
- Unsupported
features
- These features are unsupported: connection management, passive endpoint,
and shared receive context.
The psm2 provider checks for the following environment
variables:
- FI_PSM2_UUID
- PSM requires that each job has a unique ID (UUID). All the processes in
the same job need to use the same UUID in order to be able to talk to each
other. The PSM reference manual advises to keep UUID unique to each job.
In practice, it generally works fine to reuse UUID as long as (1) no two
jobs with the same UUID are running at the same time; and (2) previous
jobs with the same UUID have exited normally. If running into
"resource busy" or "connection failure" issues with
unknown reason, it is advisable to manually set the UUID to a value
different from the default.
The default UUID is 00FF00FF-0000-0000-0000-00FF0F0F00FF.
- FI_PSM2_NAME_SERVER
- The psm2 provider has a simple built-in name server that can be
used to resolve an IP address or host name into a transport address needed
by the fi_av_insert call. The main purpose of this name server is
to allow simple client-server type applications (such as those in
fabtests) to be written purely with libfabric, without using any
out-of-band communication mechanism. For such applications, the server
would run first to allow endpoints be created and registered with the name
server, and then the client would call fi_getinfo with the
node parameter set to the IP address or host name of the server.
The resulting fi_info structure would have the transport address of
the endpoint created by the server in the dest_addr field.
Optionally the service parameter can be used in addition to
node. Notice that the service number is interpreted by the
provider and is not a TCP/IP port number.
The name server is on by default. It can be turned off by setting
the variable to 0. This may save a small amount of resource since a separate
thread is created when the name server is on.
The provider detects OpenMPI and MPICH runs and changes the
default setting to off.
- FI_PSM2_TAGGED_RMA
- The RMA functions are implemented on top of the PSM Active Message
functions. The Active Message functions have limit on the size of data can
be transferred in a single message. Large transfers can be divided into
small chunks and be pipe-lined. However, the bandwidth is sub-optimal by
doing this way.
The psm2 provider use PSM tag-matching message queue
functions to achieve higher bandwidth for large size RMA. It takes advantage
of the extra tag bits available in PSM2 to separate the RMA traffic from the
regular tagged message queue.
The option is on by default. To turn it off set the variable to
0.
- FI_PSM2_DELAY
- Time (seconds) to sleep before closing PSM endpoints. This is a workaround
for a bug in some versions of PSM library.
The default setting is 0.
- FI_PSM2_TIMEOUT
- Timeout (seconds) for gracefully closing PSM endpoints. A forced closing
will be issued if timeout expires.
The default setting is 5.
- FI_PSM2_CONN_TIMEOUT
- Timeout (seconds) for establishing connection between two PSM
endpoints.
The default setting is 5.
- FI_PSM2_PROG_INTERVAL
- When auto progress is enabled (asked via the hints to fi_getinfo),
a progress thread is created to make progress calls from time to time.
This option set the interval (microseconds) between progress calls.
The default setting is 1 if affinity is set, or 1000 if not. See
FI_PSM2_PROG_AFFINITY.
- FI_PSM2_PROG_AFFINITY
- When set, specify the set of CPU cores to set the progress thread affinity
to. The format is
<start>[:<end>[:<stride>]][,<start>[:<end>[:<stride>]]]*,
where each triplet <start>:<end>:<stride>
defines a block of core_ids. Both <start> and <end> can
be either the core_id (when >=0) or
core_id - num_cores (when <0).
By default affinity is not set.
- FI_PSM2_INJECT_SIZE
- Maximum message size allowed for fi_inject and fi_tinject calls. This is
an experimental feature to allow some applications to override default
inject size limitation. When the inject size is larger than the default
value, some inject calls might block.
The default setting is 64.
- FI_PSM2_LOCK_LEVEL
- When set, dictate the level of locking being used by the provider. Level 2
means all locks are enabled. Level 1 disables some locks and is suitable
for runs that limit the access to each PSM2 context to a single thread.
Level 0 disables all locks and thus is only suitable for single threaded
runs.
To use level 0 or level 1, wait object and auto progress mode
cannot be used because they introduce internal threads that may break the
conditions needed for these levels.
The default setting is 2.
- FI_PSM2_LAZY_CONN
- There are two strategies on when to establish connections between the PSM2
endpoints that OFI endpoints are built on top of. In eager connection
mode, connections are established when addresses are inserted into the
address vector. In lazy connection mode, connections are established when
addresses are used the first time in communication. Eager connection mode
has slightly lower critical path overhead but lazy connection mode scales
better.
This option controls how the two connection modes are used. When
set to 1, lazy connection mode is always used. When set to 0, eager
connection mode is used when required conditions are all met and lazy
connection mode is used otherwise. The conditions for eager connection mode
are: (1) multiple endpoint (and scalable endpoint) support is disabled by
explicitly setting PSM2_MULTI_EP=0; and (2) the address vector type is
FI_AV_MAP.
The default setting is 0.
- FI_PSM2_DISCONNECT
- The provider has a mechanism to automatically send disconnection
notifications to all connected peers before the local endpoint is closed.
As the response, the peers call psm2_ep_disconnect to clean up the
connection state at their side. This allows the same PSM2 epid be used by
different dynamically started processes (clients) to communicate with the
same peer (server). This mechanism, however, introduce extra overhead to
the finalization phase. For applications that never reuse epids within the
same session such overhead is unnecessary.
This option controls whether the automatic disconnection
notification mechanism should be enabled. For client-server application
mentioned above, the client side should set this option to 1, but the server
should set it to 0.
The default setting is 0.
- FI_PSM2_TAG_LAYOUT
- Select how the 96-bit PSM2 tag bits are organized. Currently three choices
are available: tag60 means 32-4-60 partitioning for CQ data,
internal protocol flags, and application tag. tag64 means 4-28-64
partitioning for internal protocol flags, CQ data, and application tag.
auto means to choose either tag60 or tag64 based on
the hints passed to fi_getinfo -- tag60 is used if remote CQ data
support is requested explicitly, either by passing non-zero value via
hints->domain_attr->cq_data_size or by including
FI_REMOTE_CQ_DATA in hints->caps, otherwise tag64
is used. If tag64 is the result of automatic selection,
fi_getinfo also returns a second instance of the provider with
tag60 layout.
The default setting is auto.
Notice that if the provider is compiled with macro
PSMX2_TAG_LAYOUT defined to 1 (means tag60) or 2 (means
tag64), the choice is fixed at compile time and this runtime option
will be disabled.