fi_msg(3) | #VERSION# | fi_msg(3) |
fi_msg - Message data transfer operations
fi_send / fi_sendv / fi_sendmsg fi_inject / fi_senddata : Initiate an operation to send a message
#include <rdma/fi_endpoint.h> ssize_t fi_recv(struct fid_ep *ep, void * buf, size_t len,
void *desc, fi_addr_t src_addr, void *context); ssize_t fi_recvv(struct fid_ep *ep, const struct iovec *iov, void **desc,
size_t count, fi_addr_t src_addr, void *context); ssize_t fi_recvmsg(struct fid_ep *ep, const struct fi_msg *msg,
uint64_t flags); ssize_t fi_send(struct fid_ep *ep, const void *buf, size_t len,
void *desc, fi_addr_t dest_addr, void *context); ssize_t fi_sendv(struct fid_ep *ep, const struct iovec *iov,
void **desc, size_t count, fi_addr_t dest_addr, void *context); ssize_t fi_sendmsg(struct fid_ep *ep, const struct fi_msg *msg,
uint64_t flags); ssize_t fi_inject(struct fid_ep *ep, const void *buf, size_t len,
fi_addr_t dest_addr); ssize_t fi_senddata(struct fid_ep *ep, const void *buf, size_t len,
void *desc, uint64_t data, fi_addr_t dest_addr, void *context); ssize_t fi_injectdata(struct fid_ep *ep, const void *buf, size_t len,
uint64_t data, fi_addr_t dest_addr);
The send functions – fi_send, fi_sendv, fi_sendmsg, fi_inject, and fi_senddata – are used to transmit a message from one endpoint to another endpoint. The main difference between send functions are the number and type of parameters that they accept as input. Otherwise, they perform the same general function. Messages sent using fi_msg operations are received by a remote endpoint into a buffer posted to receive such messages.
The receive functions – fi_recv, fi_recvv, fi_recvmsg – post a data buffer to an endpoint to receive inbound messages. Similar to the send operations, receive operations operate asynchronously. Users should not touch the posted data buffer(s) until the receive operation has completed.
An endpoint must be enabled before an application can post send or receive operations to it. For connected endpoints, receive buffers may be posted prior to connect or accept being called on the endpoint. This ensures that buffers are available to receive incoming data immediately after the connection has been established.
Completed message operations are reported to the user through one or more event collectors associated with the endpoint. Users provide context which are associated with each operation, and is returned to the user as part of the event completion. See fi_cq for completion event details.
The call fi_send transfers the data contained in the user-specified data buffer to a remote endpoint, with message boundaries being maintained.
The fi_sendv call adds support for a scatter-gather list to fi_send. The fi_sendv transfers the set of data buffers referenced by the iov parameter to a remote endpoint as a single message.
The fi_sendmsg call supports data transfers over both connected and connectionless endpoints, with the ability to control the send operation per call through the use of flags. The fi_sendmsg function takes a struct fi_msg as input.
struct fi_msg {
const struct iovec *msg_iov; /* scatter-gather array */
void **desc; /* local request descriptors */
size_t iov_count;/* # elements in iov */
fi_addr_t addr; /* optional endpoint address */
void *context; /* user-defined context */
uint64_t data; /* optional message data */ };
The send inject call is an optimized version of fi_send with the following characteristics. The data buffer is available for reuse immediately on return from the call, and no CQ entry will be written if the transfer completes successfully.
Conceptually, this means that the fi_inject function behaves as if the FI_INJECT transfer flag were set, selective completions are enabled, and the FI_COMPLETION flag is not specified. Note that the CQ entry will be suppressed even if the default behavior of the endpoint is to write CQ entries for all successful completions. See the flags discussion below for more details. The requested message size that can be used with fi_inject is limited by inject_size.
The send data call is similar to fi_send, but allows for the sending of remote CQ data (see FI_REMOTE_CQ_DATA flag) as part of the transfer.
The inject data call is similar to fi_inject, but allows for the sending of remote CQ data (see FI_REMOTE_CQ_DATA flag) as part of the transfer.
The fi_recv call posts a data buffer to the receive queue of the corresponding endpoint. Posted receives are searched in the order in which they were posted in order to match sends. Message boundaries are maintained. The order in which the receives complete is dependent on the endpoint type and protocol. For connectionless endpoints, the src_addr parameter can be used to indicate that a buffer should be posted to receive incoming data from a specific remote endpoint.
The fi_recvv call adds support for a scatter-gather list to fi_recv. The fi_recvv posts the set of data buffers referenced by the iov parameter to a receive incoming data.
The fi_recvmsg call supports posting buffers over both connected and connectionless endpoints, with the ability to control the receive operation per call through the use of flags. The fi_recvmsg function takes a struct fi_msg as input.
The fi_recvmsg and fi_sendmsg calls allow the user to specify flags which can change the default message handling of the endpoint. Flags specified with fi_recvmsg / fi_sendmsg override most flags previously configured with the endpoint, except where noted (see fi_endpoint.3). The following list of flags are usable with fi_recvmsg and/or fi_sendmsg.
The buffer will be released by the provider when the available buffer space falls below the specified minimum (see FI_OPT_MIN_MULTI_RECV). Note that an entry to the associated receive completion queue will always be generated when the buffer has been consumed, even if other receive completions have been suppressed (i.e. the Rx context has been configured for FI_SELECTIVE_COMPLETION). See the FI_MULTI_RECV completion flag fi_cq(3).
The ordering of operations starting at the posting of the fenced operation (inclusive) to the posting of a subsequent fenced operation (exclusive) is controlled by the endpoint’s ordering semantics.
Buffered receives indicate that the networking layer allocates and manages the data buffers used to receive network data transfers. As a result, received messages must be copied from the network buffers into application buffers for processing. However, applications can avoid this copy if they are able to process the message in place (directly from the networking buffers).
Handling buffered receives differs based on the size of the message being sent. In general, smaller messages are passed directly to the application for processing. However, for large messages, an application will only receive the start of the message and must claim the rest. The details for how small messages are reported and large messages may be claimed are described below.
When a provider receives a message, it will write an entry to the completion queue associated with the receiving endpoint. For discussion purposes, the completion queue is assumed to be configured for FI_CQ_FORMAT_DATA. Since buffered receives are not associated with application posted buffers, the CQ entry op_context will point to a struct fi_recv_context.
struct fi_recv_context {
struct fid_ep *ep;
void *context; };
The `ep' field will point to the receiving endpoint or Rx context, and `context' will be NULL. The CQ entry’s `buf' will point to a provider managed buffer where the start of the received message is located, and `len' will be set to the total size of the message.
The maximum sized message that a provider can buffer is limited by an FI_OPT_BUFFERED_LIMIT. This threshold can be obtained and may be adjusted by the application using the fi_getopt and fi_setopt calls, respectively. Any adjustments must be made prior to enabling the endpoint. The CQ entry `buf' will point to a buffer of received data. If the sent message is larger than the buffered amount, the CQ entry `flags' will have the FI_MORE bit set. When the FI_MORE bit is set, `buf' will reference at least FI_OPT_BUFFERED_MIN bytes of data (see fi_endpoint.3 for more info).
After being notified that a buffered receive has arrived, applications must either claim or discard the message. Typically, small messages are processed and discarded, while large messages are claimed. However, an application is free to claim or discard any message regardless of message size.
To claim a message, an application must post a receive operation with the FI_CLAIM flag set. The struct fi_recv_context returned as part of the notification must be provided as the receive operation’s context. The struct fi_recv_context contains a `context' field. Applications may modify this field prior to claiming the message. When the claim operation completes, a standard receive completion entry will be generated on the completion queue. The `context' of the associated CQ entry will be set to the `context' value passed in through the fi_recv_context structure, and the CQ entry flags will have the FI_CLAIM bit set.
Buffered receives that are not claimed must be discarded by the application when it is done processing the CQ entry data. To discard a message, an application must post a receive operation with the FI_DISCARD flag set. The struct fi_recv_context returned as part of the notification must be provided as the receive operation’s context. When the FI_DISCARD flag is set for a receive operation, the receive input buffer(s) and length parameters are ignored.
IMPORTANT: Buffered receives must be claimed or discarded in a timely manner. Failure to do so may result in increased memory usage for network buffering or communication stalls. Once a buffered receive has been claimed or discarded, the original CQ entry `buf' or struct fi_recv_context data may no longer be accessed by the application.
The use of the FI_CLAIM and FI_DISCARD operation flags is also described with respect to tagged message transfers in fi_tagged.3. Buffered receives of tagged messages will include the message tag as part of the CQ entry, if available.
The handling of buffered receives follows all message ordering restrictions assigned to an endpoint. For example, completions may indicate the order in which received messages arrived at the receiver based on the endpoint attributes.
Variable length messages, or simply variable messages, are transfers where the size of the message is unknown to the receiver prior to the message being sent. It indicates that the recipient of a message does not know the amount of data to expect prior to the message arriving. It is most commonly used when the size of message transfers varies greatly, with very large messages interspersed with much smaller messages, making receive side message buffering difficult to manage. Variable messages are not subject to max message length restrictions (i.e. struct fi_ep_attr::max_msg_size limits), and may be up to the maximum value of size_t (e.g. SIZE_MAX) in length.
Variable length messages support requests that the provider allocate and manage the network message buffers. As a result, the application requirements and provider behavior is identical as those defined for supporting the FI_BUFFERED_RECV mode bit. See the Buffered Receive section above for details. The main difference is that buffered receives are limited by the fi_ep_attr::max_msg_size threshold, whereas variable length messages are not.
Support for variable messages is indicated through the FI_VARIABLE_MSG capability bit.
If an endpoint has been configured with FI_MSG_PREFIX, the application must include buffer space of size msg_prefix_size, as specified by the endpoint attributes. The prefix buffer must occur at the start of the data referenced by the buf parameter, or be referenced by the first IO vector. Message prefix space cannot be split between multiple IO vectors. The size of the prefix buffer should be included as part of the total buffer length.
Returns 0 on success. On error, a negative value corresponding to fabric errno is returned. Fabric errno values are defined in rdma/fi_errno.h.
See the discussion below for details handling FI_EAGAIN.
Insufficient internal buffering is often associated with operations that use FI_INJECT. In such cases, additional buffering may become available as posted operations complete.
Full processing queues may be a temporary state related to local processing (for example, a large message is being transferred), or may be the result of flow control. In the latter case, the queues may remain blocked until additional resources are made available at the remote side of the transfer.
In all cases, the operation may be retried after additional resources become available. It is strongly recommended that applications check for transmit and receive completions after receiving FI_EAGAIN as a return value, independent of the operation which failed. This is particularly important in cases where manual progress is employed, as acknowledgements or flow control messages may need to be processed in order to resume execution.
OpenFabrics.
2022-12-11 | Libfabric Programmer’s Manual |