fi_endpoint(3) | #VERSION# | fi_endpoint(3) |
fi_endpoint - Fabric endpoint operations
#include <rdma/fabric.h> #include <rdma/fi_endpoint.h> int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
struct fid_ep **ep, void *context); int fi_endpoint2(struct fid_domain *domain, struct fi_info *info,
struct fid_ep **ep, uint64_t flags, void *context); int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
struct fid_ep **sep, void *context); int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
struct fid_pep **pep, void *context); int fi_tx_context(struct fid_ep *sep, int index,
struct fi_tx_attr *attr, struct fid_ep **tx_ep,
void *context); int fi_rx_context(struct fid_ep *sep, int index,
struct fi_rx_attr *attr, struct fid_ep **rx_ep,
void *context); int fi_stx_context(struct fid_domain *domain,
struct fi_tx_attr *attr, struct fid_stx **stx,
void *context); int fi_srx_context(struct fid_domain *domain,
struct fi_rx_attr *attr, struct fid_ep **rx_ep,
void *context); int fi_close(struct fid *ep); int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags); int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags); int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags); int fi_enable(struct fid_ep *ep); int fi_cancel(struct fid_ep *ep, void *context); int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags); int fi_control(struct fid *ep, int command, void *arg); int fi_getopt(struct fid *ep, int level, int optname,
void *optval, size_t *optlen); int fi_setopt(struct fid *ep, int level, int optname,
const void *optval, size_t optlen); uint32_t fi_tc_dscp_set(uint8_t dscp); uint8_t fi_tc_dscp_get(uint32_t tclass); DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep); DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);
Endpoints are transport level communication portals. There are two types of endpoints: active and passive. Passive endpoints belong to a fabric domain and are most often used to listen for incoming connection requests. However, a passive endpoint may be used to reserve a fabric address that can be granted to an active endpoint. Active endpoints belong to access domains and can perform data transfers.
Active endpoints may be connection-oriented or connectionless, and may provide data reliability. The data transfer interfaces – messages (fi_msg), tagged messages (fi_tagged), RMA (fi_rma), and atomics (fi_atomic) – are associated with active endpoints. In basic configurations, an active endpoint has transmit and receive queues. In general, operations that generate traffic on the fabric are posted to the transmit queue. This includes all RMA and atomic operations, along with sent messages and sent tagged messages. Operations that post buffers for receiving incoming data are submitted to the receive queue.
Active endpoints are created in the disabled state. They must transition into an enabled state before accepting data transfer operations, including posting of receive buffers. The fi_enable call is used to transition an active endpoint into an enabled state. The fi_connect and fi_accept calls will also transition an endpoint into the enabled state, if it is not already active.
In order to transition an endpoint into an enabled state, it must be bound to one or more fabric resources. An endpoint that will generate asynchronous completions, either through data transfer operations or communication establishment events, must be bound to the appropriate completion queues or event queues, respectively, before being enabled. Additionally, endpoints that use manual progress must be associated with relevant completion queues or event queues in order to drive progress. For endpoints that are only used as the target of RMA or atomic operations, this means binding the endpoint to a completion queue associated with receive processing. Connectionless endpoints must be bound to an address vector.
Once an endpoint has been activated, it may be associated with an address vector. Receive buffers may be posted to it and calls may be made to connection establishment routines. Connectionless endpoints may also perform data transfers.
The behavior of an endpoint may be adjusted by setting its control data and protocol options. This allows the underlying provider to redirect function calls to implementations optimized to meet the desired application behavior.
If an endpoint experiences a critical error, it will transition back into a disabled state. Critical errors are reported through the event queue associated with the EP. In certain cases, a disabled endpoint may be re-enabled. The ability to transition back into an enabled state is provider specific and depends on the type of error that the endpoint experienced. When an endpoint is disabled as a result of a critical error, all pending operations are discarded.
fi_endpoint allocates a new active endpoint. fi_passive_ep allocates a new passive endpoint. fi_scalable_ep allocates a scalable endpoint. The properties and behavior of the endpoint are defined based on the provided struct fi_info. See fi_getinfo for additional details on fi_info. fi_info flags that control the operation of an endpoint are defined below. See section SCALABLE ENDPOINTS.
If an active endpoint is allocated in order to accept a connection request, the fi_info parameter must be the same as the fi_info structure provided with the connection request (FI_CONNREQ) event.
An active endpoint may acquire the properties of a passive endpoint by setting the fi_info handle field to the passive endpoint fabric descriptor. This is useful for applications that need to reserve the fabric address of an endpoint prior to knowing if the endpoint will be used on the active or passive side of a connection. For example, this feature is useful for simulating socket semantics. Once an active endpoint acquires the properties of a passive endpoint, the passive endpoint is no longer bound to any fabric resources and must no longer be used. The user is expected to close the passive endpoint after opening the active endpoint in order to free up any lingering resources that had been used.
Similar to fi_endpoint, buf accepts an extra parameter flags. Mainly used for opening endpoints that use peer transfer feature. See fi_peer(3)
Closes an endpoint and release all resources associated with it.
When closing a scalable endpoint, there must be no opened transmit contexts, or receive contexts associated with the scalable endpoint. If resources are still associated with the scalable endpoint when attempting to close, the call will return -FI_EBUSY.
Outstanding operations posted to the endpoint when fi_close is called will be discarded. Discarded operations will silently be dropped, with no completions reported. Additionally, a provider may discard previously completed operations from the associated completion queue(s). The behavior to discard completed operations is provider specific.
fi_ep_bind is used to associate an endpoint with other allocated resources, such as completion queues, counters, address vectors, event queues, shared contexts, and memory regions. The type of objects that must be bound with an endpoint depend on the endpoint type and its configuration.
Passive endpoints must be bound with an EQ that supports connection management events. Connectionless endpoints must be bound to a single address vector. If an endpoint is using a shared transmit and/or receive context, the shared contexts must be bound to the endpoint. CQs, counters, AV, and shared contexts must be bound to endpoints before they are enabled either explicitly or implicitly.
An endpoint must be bound with CQs capable of reporting completions for any asynchronous operation initiated on the endpoint. For example, if the endpoint supports any outbound transfers (sends, RMA, atomics, etc.), then it must be bound to a completion queue that can report transmit completions. This is true even if the endpoint is configured to suppress successful completions, in order that operations that complete in error may be reported to the user.
An active endpoint may direct asynchronous completions to different CQs, based on the type of operation. This is specified using fi_ep_bind flags. The following flags may be OR’ed together when binding an endpoint to a completion domain CQ.
When FI_SELECTIVE_COMPLETION is set, the user must determine when a request that does NOT have FI_COMPLETION set has completed indirectly, usually based on the completion of a subsequent operation or by using completion counters. Use of this flag may improve performance by allowing the provider to avoid writing a CQ completion entry for every operation.
See Notes section below for additional information on how this flag interacts with the FI_CONTEXT and FI_CONTEXT2 mode bits.
An endpoint may optionally be bound to a completion counter. Associating an endpoint with a counter is in addition to binding the EP with a CQ. When binding an endpoint to a counter, the following flags may be specified.
An endpoint may only be bound to a single CQ or counter for a given type of operation. For example, a EP may not bind to two counters both using FI_WRITE. Furthermore, providers may limit CQ and counter bindings to endpoints of the same endpoint type (DGRAM, MSG, RDM, etc.).
fi_scalable_ep_bind is used to associate a scalable endpoint with an address vector. See section on SCALABLE ENDPOINTS. A scalable endpoint has a single transport level address and can support multiple transmit and receive contexts. The transmit and receive contexts share the transport-level address. Address vectors that are bound to scalable endpoints are implicitly bound to any transmit or receive contexts created using the scalable endpoint.
This call transitions the endpoint into an enabled state. An endpoint must be enabled before it may be used to perform data transfers. Enabling an endpoint typically results in hardware resources being assigned to it. Endpoints making use of completion queues, counters, event queues, and/or address vectors must be bound to them before being enabled.
Calling connect or accept on an endpoint will implicitly enable an endpoint if it has not already been enabled.
fi_enable may also be used to re-enable an endpoint that has been disabled as a result of experiencing a critical error. Applications should check the return value from fi_enable to see if a disabled endpoint has successfully be re-enabled.
fi_cancel attempts to cancel an outstanding asynchronous operation. Canceling an operation causes the fabric provider to search for the operation and, if it is still pending, complete it as having been canceled. An error queue entry will be available in the associated error queue with error code FI_ECANCELED. On the other hand, if the operation completed before the call to fi_cancel, then the completion status of that operation will be available in the associated completion queue. No specific entry related to fi_cancel itself will be posted.
Cancel uses the context parameter associated with an operation to identify the request to cancel. Operations posted without a valid context parameter – either no context parameter is specified or the context value was ignored by the provider – cannot be canceled. If multiple outstanding operations match the context parameter, only one will be canceled. In this case, the operation which is canceled is provider specific. The cancel operation is asynchronous, but will complete within a bounded period of time.
This call creates an alias to the specified endpoint. Conceptually, an endpoint alias provides an alternate software path from the application to the underlying provider hardware. An alias EP differs from its parent endpoint only by its default data transfer flags. For example, an alias EP may be configured to use a different completion mode. By default, an alias EP inherits the same data transfer flags as the parent endpoint. An application can use fi_control to modify the alias EP operational flags.
When allocating an alias, an application may configure either the transmit or receive operational flags. This avoids needing a separate call to fi_control to set those flags. The flags passed to fi_ep_alias must include FI_TRANSMIT or FI_RECV (not both) with other operational flags OR’ed in. This will override the transmit or receive flags, respectively, for operations posted through the alias endpoint. All allocated aliases must be closed for the underlying endpoint to be released.
The control operation is used to adjust the default behavior of an endpoint. It allows the underlying provider to redirect function calls to implementations optimized to meet the desired application behavior. As a result, calls to fi_ep_control must be serialized against all other calls to an endpoint.
The base operation of an endpoint is selected during creation using struct fi_info. The following control commands and arguments may be assigned to an endpoint.
Endpoint protocol operations may be retrieved using fi_getopt or set using fi_setopt. Applications specify the level that a desired option exists, identify the option, and provide input/output buffers to get or set the option. fi_setopt provides an application a way to adjust low-level protocol and implementation specific details of an endpoint.
The following option levels and option names and parameters are defined.
FI_OPT_ENDPOINT • .RS 2
fi_getopt() will return the currently configured threshold, or the provider’s default threshold if one has not be set by the application. fi_setopt() allows an application to configure the threshold. If the provider cannot support the requested threshold, it will fail the fi_setopt() call with FI_EMSGSIZE. Calling fi_setopt() with the threshold set to SIZE_MAX will set the threshold to the maximum supported by the provider. fi_getopt() can then be used to retrieve the set size.
In most cases, the sending and receiving endpoints must be
configured to use the same threshold value, and the threshold must be set
prior to enabling the endpoint.
• .RS 2
The user provides a filled out struct fi_trigger_xpu on input. The iface and device fields should reference an HMEM domain. If the provider does not support XPU triggered operations from the given device, fi_getopt() will return -FI_EOPNOTSUPP. On input, var should reference an array of struct fi_trigger_var data structures, with count set to the size of the referenced array. If count is 0, the var field will be ignored, and the provider will return the number of fi_trigger_var structures needed. If count is > 0, the provider will set count to the needed value, and for each fi_trigger_var available, set the datatype and count of the variable used for the trigger.
This call converts a DSCP defined value into a libfabric traffic class value. It should be used when assigning a DSCP value when setting the tclass field in either domain or endpoint attributes
This call returns the DSCP value associated with the tclass field for the domain or endpoint attributes.
This function has been deprecated and will be removed in a future version of the library. It may not be supported by all providers.
The fi_rx_size_left call returns a lower bound on the number of receive operations that may be posted to the given endpoint without that operation returning -FI_EAGAIN. Depending on the specific details of the subsequently posted receive operations (e.g., number of iov entries, which receive function is called, etc.), it may be possible to post more receive operations than originally indicated by fi_rx_size_left.
This function has been deprecated and will be removed in a future version of the library. It may not be supported by all providers.
The fi_tx_size_left call returns a lower bound on the number of transmit operations that may be posted to the given endpoint without that operation returning -FI_EAGAIN. Depending on the specific details of the subsequently posted transmit operations (e.g., number of iov entries, which transmit function is called, etc.), it may be possible to post more transmit operations than originally indicated by fi_tx_size_left.
The fi_ep_attr structure defines the set of attributes associated with an endpoint. Endpoint attributes may be further refined using the transmit and receive context attributes as shown below.
struct fi_ep_attr {
enum fi_ep_type type;
uint32_t protocol;
uint32_t protocol_version;
size_t max_msg_size;
size_t msg_prefix_size;
size_t max_order_raw_size;
size_t max_order_war_size;
size_t max_order_waw_size;
uint64_t mem_tag_format;
size_t tx_ctx_cnt;
size_t rx_ctx_cnt;
size_t auth_key_size;
uint8_t *auth_key; };
If specified, indicates the type of fabric interface communication desired. Supported types are:
Specifies the low-level end to end protocol employed by the provider. A matching protocol must be used by communicating endpoints to ensure interoperability. The following protocol values are defined. Provider specific protocols are also allowed. Provider specific protocols will be indicated by having the upper bit of the protocol value set to one.
Identifies which version of the protocol is employed by the provider. The protocol version allows providers to extend an existing protocol, by adding support for additional features or functionality for example, in a backward compatible manner. Providers that support different versions of the same protocol should inter-operate, but only when using the capabilities defined for the lesser version.
Defines the maximum size for an application data transfer as a single operation.
Specifies the size of any required message prefix buffer space. This field will be 0 unless the FI_MSG_PREFIX mode is enabled. If msg_prefix_size is > 0 the specified value will be a multiple of 8-bytes.
The maximum ordered size specifies the delivery order of transport data into target memory for RMA and atomic operations. Data ordering is separate, but dependent on message ordering (defined below). Data ordering is unspecified where message order is not defined.
Data ordering refers to the access of the same target memory by subsequent operations. When back to back RMA read or write operations access the same registered memory location, data ordering indicates whether the second operation reads or writes the target memory after the first operation has completed. For example, will an RMA read that follows an RMA write read back the data that was written? Similarly, will an RMA write that follows an RMA read update the target buffer after the read has transferred the original data? Data ordering answers these questions, even in the presence of errors, such as the need to resend data because of lost or corrupted network traffic.
RMA ordering applies between two operations, and not within a single data transfer. Therefore, ordering is defined per byte-addressable memory location. I.e. ordering specifies whether location X is accessed by the second operation after the first operation. Nothing is implied about the completion of the first operation before the second operation is initiated. For example, if the first operation updates locations X and Y, but the second operation only accesses location X, there are no guarantees defined relative to location Y and the second operation.
In order to support large data transfers being broken into multiple packets and sent using multiple paths through the fabric, data ordering may be limited to transfers of a specific size or less. Providers specify when data ordering is maintained through the following values. Note that even if data ordering is not maintained, message ordering may be.
An order size value of 0 indicates that ordering is not guaranteed. A value of -1 guarantees ordering for any data size.
The memory tag format is a bit array used to convey the number of tagged bits supported by a provider. Additionally, it may be used to divide the bit array into separate fields. The mem_tag_format optionally begins with a series of bits set to 0, to signify bits which are ignored by the provider. Following the initial prefix of ignored bits, the array will consist of alternating groups of bits set to all 1’s or all 0’s. Each group of bits corresponds to a tagged field. The implication of defining a tagged field is that when a mask is applied to the tagged bit array, all bits belonging to a single field will either be set to 1 or 0, collectively.
For example, a mem_tag_format of 0x30FF indicates support for 14 tagged bits, separated into 3 fields. The first field consists of 2-bits, the second field 4-bits, and the final field 8-bits. Valid masks for such a tagged field would be a bitwise OR’ing of zero or more of the following values: 0x3000, 0x0F00, and 0x00FF. The provider may not validate the mask provided by the application for performance reasons.
By identifying fields within a tag, a provider may be able to optimize their search routines. An application which requests tag fields must provide tag masks that either set all mask bits corresponding to a field to all 0 or all 1. When negotiating tag fields, an application can request a specific number of fields of a given size. A provider must return a tag format that supports the requested number of fields, with each field being at least the size requested, or fail the request. A provider may increase the size of the fields. When reporting completions (see FI_CQ_FORMAT_TAGGED), it is not guaranteed that the provider would clear out any unsupported tag bits in the tag field of the completion entry.
It is recommended that field sizes be ordered from smallest to largest. A generic, unstructured tag and mask can be achieved by requesting a bit array consisting of alternating 1’s and 0’s.
Number of transmit contexts to associate with the endpoint. If not specified (0), 1 context will be assigned if the endpoint supports outbound transfers. Transmit contexts are independent transmit queues that may be separately configured. Each transmit context may be bound to a separate CQ, and no ordering is defined between contexts. Additionally, no synchronization is needed when accessing contexts in parallel.
If the count is set to the value FI_SHARED_CONTEXT, the endpoint will be configured to use a shared transmit context, if supported by the provider. Providers that do not support shared transmit contexts will fail the request.
See the scalable endpoint and shared contexts sections for additional details.
Number of receive contexts to associate with the endpoint. If not specified, 1 context will be assigned if the endpoint supports inbound transfers. Receive contexts are independent processing queues that may be separately configured. Each receive context may be bound to a separate CQ, and no ordering is defined between contexts. Additionally, no synchronization is needed when accessing contexts in parallel.
If the count is set to the value FI_SHARED_CONTEXT, the endpoint will be configured to use a shared receive context, if supported by the provider. Providers that do not support shared receive contexts will fail the request.
See the scalable endpoint and shared contexts sections for additional details.
The length of the authorization key in bytes. This field will be 0 if authorization keys are not available or used. This field is ignored unless the fabric is opened with API version 1.5 or greater.
If supported by the fabric, an authorization key (a.k.a. job key) to associate with the endpoint. An authorization key is used to limit communication between endpoints. Only peer endpoints that are programmed to use the same authorization key may communicate. Authorization keys are often used to implement job keys, to ensure that processes running in different jobs do not accidentally cross traffic. The domain authorization key will be used if auth_key_size is set to 0. This field is ignored unless the fabric is opened with API version 1.5 or greater.
Attributes specific to the transmit capabilities of an endpoint are specified using struct fi_tx_attr.
struct fi_tx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t inject_size;
size_t size;
size_t iov_limit;
size_t rma_iov_limit;
uint32_t tclass; };
The requested capabilities of the context. The capabilities must be a subset of those requested of the associated endpoint. See the CAPABILITIES section of fi_getinfo(3) for capability details. If the caps field is 0 on input to fi_getinfo(3), the applicable capability bits from the fi_info structure will be used.
The following capabilities apply to the transmit attributes: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND, FI_HMEM, FI_TRIGGER, FI_FENCE, FI_MULTICAST, FI_RMA_PMEM, FI_NAMED_RX_CTX, FI_COLLECTIVE, and FI_XPU.
Many applications will be able to ignore this field and rely solely on the fi_info::caps field. Use of this field provides fine grained control over the transmit capabilities associated with an endpoint. It is useful when handling scalable endpoints, with multiple transmit contexts, for example, and allows configuring a specific transmit context with fewer capabilities than that supported by the endpoint or other transmit contexts.
The operational mode bits of the context. The mode bits will be a subset of those associated with the endpoint. See the MODE section of fi_getinfo(3) for details. A mode value of 0 will be ignored on input to fi_getinfo(3), with the mode value of the fi_info structure used instead. On return from fi_getinfo(3), the mode will be set only to those constraints specific to transmit operations.
Flags that control the operation of operations submitted against the context. Applicable flags are listed in the Operation Flags section.
Message ordering refers to the order in which transport layer headers (as viewed by the application) are identified and processed. Relaxed message order enables data transfers to be sent and received out of order, which may improve performance by utilizing multiple paths through the fabric from the initiating endpoint to a target endpoint. Message order applies only between a single source and destination endpoint pair. Ordering between different target endpoints is not defined.
Message order is determined using a set of ordering bits. Each set bit indicates that ordering is maintained between data transfers of the specified type. Message order is defined for [read | write | send] operations submitted by an application after [read | write | send] operations.
Message ordering only applies to the end to end transmission of transport headers. Message ordering is necessary, but does not guarantee, the order in which message data is sent or received by the transport layer. Message ordering requires matching ordering semantics on the receiving side of a data transfer operation in order to guarantee that ordering is met.
Completion ordering refers to the order in which completed requests are written into the completion queue. Completion ordering is similar to message order. Relaxed completion order may enable faster reporting of completed transfers, allow acknowledgments to be sent over different fabric paths, and support more sophisticated retry mechanisms. This can result in lower-latency completions, particularly when using connectionless endpoints. Strict completion ordering may require that providers queue completed operations or limit available optimizations.
For transmit requests, completion ordering depends on the endpoint communication type. For unreliable communication, completion ordering applies to all data transfer requests submitted to an endpoint. For reliable communication, completion ordering only applies to requests that target a single destination endpoint. Completion ordering of requests that target different endpoints over a reliable transport is not defined.
Applications should specify the completion ordering that they support or require. Providers should return the completion order that they actually provide, with the constraint that the returned ordering is stricter than that specified by the application. Supported completion order values are:
The requested inject operation size (see the FI_INJECT flag) that the context will support. This is the maximum size data transfer that can be associated with an inject operation (such as fi_inject) or may be used with the FI_INJECT data transfer flag.
The size of the transmit context. The mapping of the size value to resources is provider specific, but it is directly related to the number of command entries allocated for the endpoint. A smaller size value consumes fewer hardware and software resources, while a larger size allows queuing more transmit requests.
While the size attribute guides the size of underlying endpoint transmit queue, there is not necessarily a one-to-one mapping between a transmit operation and a queue entry. A single transmit operation may consume multiple queue entries; for example, one per scatter-gather entry. Additionally, the size field is intended to guide the allocation of the endpoint’s transmit context. Specifically, for connectionless endpoints, there may be lower-level queues use to track communication on a per peer basis. The sizes of any lower-level queues may only be significantly smaller than the endpoint’s transmit size, in order to reduce resource utilization.
This is the maximum number of IO vectors (scatter-gather elements) that a single posted operation may reference.
This is the maximum number of RMA IO vectors (scatter-gather elements) that an RMA or atomic operation may reference. The rma_iov_limit corresponds to the rma_iov_count values in RMA and atomic operations. See struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and fi_atomic.3, for additional details. This limit applies to both the number of RMA IO vectors that may be specified when initiating an operation from the local endpoint, as well as the maximum number of IO vectors that may be carried in a single request from a remote endpoint.
Traffic classes can be a differentiated services code point (DSCP) value, one of the following defined labels, or a provider-specific definition. If tclass is unset or set to FI_TC_UNSPEC, the endpoint will use the default traffic class associated with the domain.
Attributes specific to the receive capabilities of an endpoint are specified using struct fi_rx_attr.
struct fi_rx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t total_buffered_recv;
size_t size;
size_t iov_limit; };
The requested capabilities of the context. The capabilities must be a subset of those requested of the associated endpoint. See the CAPABILITIES section if fi_getinfo(3) for capability details. If the caps field is 0 on input to fi_getinfo(3), the applicable capability bits from the fi_info structure will be used.
The following capabilities apply to the receive attributes: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_RECV, FI_HMEM, FI_TRIGGER, FI_RMA_PMEM, FI_DIRECTED_RECV, FI_VARIABLE_MSG, FI_MULTI_RECV, FI_SOURCE, FI_RMA_EVENT, FI_SOURCE_ERR, FI_COLLECTIVE, and FI_XPU.
Many applications will be able to ignore this field and rely solely on the fi_info::caps field. Use of this field provides fine grained control over the receive capabilities associated with an endpoint. It is useful when handling scalable endpoints, with multiple receive contexts, for example, and allows configuring a specific receive context with fewer capabilities than that supported by the endpoint or other receive contexts.
The operational mode bits of the context. The mode bits will be a subset of those associated with the endpoint. See the MODE section of fi_getinfo(3) for details. A mode value of 0 will be ignored on input to fi_getinfo(3), with the mode value of the fi_info structure used instead. On return from fi_getinfo(3), the mode will be set only to those constraints specific to receive operations.
Flags that control the operation of operations submitted against the context. Applicable flags are listed in the Operation Flags section.
For a description of message ordering, see the msg_order field in the Transmit Context Attribute section. Receive context message ordering defines the order in which received transport message headers are processed when received by an endpoint. When ordering is set, it indicates that message headers will be processed in order, based on how the transmit side has identified the messages. Typically, this means that messages will be handled in order based on a message level sequence number.
The following ordering flags, as defined for transmit ordering, also apply to the processing of received operations: FI_ORDER_NONE, FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAR, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, and FI_ORDER_ATOMIC_WAW.
For a description of completion ordering, see the comp_order field in the Transmit Context Attribute section.
This field is supported for backwards compatibility purposes. It is a hint to the provider of the total available space that may be needed to buffer messages that are received for which there is no matching receive operation. The provider may adjust or ignore this value. The allocation of internal network buffering among received message is provider specific. For instance, a provider may limit the size of messages which can be buffered or the amount of buffering allocated to a single message.
If receive side buffering is disabled (total_buffered_recv = 0) and a message is received by an endpoint, then the behavior is dependent on whether resource management has been enabled (FI_RM_ENABLED has be set or not). See the Resource Management section of fi_domain.3 for further clarification. It is recommended that applications enable resource management if they anticipate receiving unexpected messages, rather than modifying this value.
The size of the receive context. The mapping of the size value to resources is provider specific, but it is directly related to the number of command entries allocated for the endpoint. A smaller size value consumes fewer hardware and software resources, while a larger size allows queuing more transmit requests.
While the size attribute guides the size of underlying endpoint receive queue, there is not necessarily a one-to-one mapping between a receive operation and a queue entry. A single receive operation may consume multiple queue entries; for example, one per scatter-gather entry. Additionally, the size field is intended to guide the allocation of the endpoint’s receive context. Specifically, for connectionless endpoints, there may be lower-level queues use to track communication on a per peer basis. The sizes of any lower-level queues may only be significantly smaller than the endpoint’s receive size, in order to reduce resource utilization.
This is the maximum number of IO vectors (scatter-gather elements) that a single posted operating may reference.
A scalable endpoint is a communication portal that supports multiple transmit and receive contexts. Scalable endpoints are loosely modeled after the networking concept of transmit/receive side scaling, also known as multi-queue. Support for scalable endpoints is domain specific. Scalable endpoints may improve the performance of multi-threaded and parallel applications, by allowing threads to access independent transmit and receive queues. A scalable endpoint has a single transport level address, which can reduce the memory requirements needed to store remote addressing data, versus using standard endpoints. Scalable endpoints cannot be used directly for communication operations, and require the application to explicitly create transmit and receive contexts as described below.
Transmit contexts are independent transmit queues. Ordering and synchronization between contexts are not defined. Conceptually a transmit context behaves similar to a send-only endpoint. A transmit context may be configured with fewer capabilities than the base endpoint and with different attributes (such as ordering requirements and inject size) than other contexts associated with the same scalable endpoint. Each transmit context has its own completion queue. The number of transmit contexts associated with an endpoint is specified during endpoint creation.
The fi_tx_context call is used to retrieve a specific context, identified by an index (see above for details on transmit context attributes). Providers may dynamically allocate contexts when fi_tx_context is called, or may statically create all contexts when fi_endpoint is invoked. By default, a transmit context inherits the properties of its associated endpoint. However, applications may request context specific attributes through the attr parameter. Support for per transmit context attributes is provider specific and not guaranteed. Providers will return the actual attributes assigned to the context through the attr parameter, if provided.
Receive contexts are independent receive queues for receiving incoming data. Ordering and synchronization between contexts are not guaranteed. Conceptually a receive context behaves similar to a receive-only endpoint. A receive context may be configured with fewer capabilities than the base endpoint and with different attributes (such as ordering requirements and inject size) than other contexts associated with the same scalable endpoint. Each receive context has its own completion queue. The number of receive contexts associated with an endpoint is specified during endpoint creation.
Receive contexts are often associated with steering flows, that specify which incoming packets targeting a scalable endpoint to process. However, receive contexts may be targeted directly by the initiator, if supported by the underlying protocol. Such contexts are referred to as `named'. Support for named contexts must be indicated by setting the caps FI_NAMED_RX_CTX capability when the corresponding endpoint is created. Support for named receive contexts is coordinated with address vectors. See fi_av(3) and fi_rx_addr(3).
The fi_rx_context call is used to retrieve a specific context, identified by an index (see above for details on receive context attributes). Providers may dynamically allocate contexts when fi_rx_context is called, or may statically create all contexts when fi_endpoint is invoked. By default, a receive context inherits the properties of its associated endpoint. However, applications may request context specific attributes through the attr parameter. Support for per receive context attributes is provider specific and not guaranteed. Providers will return the actual attributes assigned to the context through the attr parameter, if provided.
Shared contexts are transmit and receive contexts explicitly shared among one or more endpoints. A shareable context allows an application to use a single dedicated provider resource among multiple transport addressable endpoints. This can greatly reduce the resources needed to manage communication over multiple endpoints by multiplexing transmit and/or receive processing, with the potential cost of serializing access across multiple endpoints. Support for shareable contexts is domain specific.
Conceptually, shareable transmit contexts are transmit queues that may be accessed by many endpoints. The use of a shared transmit context is mostly opaque to an application. Applications must allocate and bind shared transmit contexts to endpoints, but operations are posted directly to the endpoint. Shared transmit contexts are not associated with completion queues or counters. Completed operations are posted to the CQs bound to the endpoint. An endpoint may only be associated with a single shared transmit context.
Unlike shared transmit contexts, applications interact directly with shared receive contexts. Users post receive buffers directly to a shared receive context, with the buffers usable by any endpoint bound to the shared receive context. Shared receive contexts are not associated with completion queues or counters. Completed receive operations are posted to the CQs bound to the endpoint. An endpoint may only be associated with a single receive context, and all connectionless endpoints associated with a shared receive context must also share the same address vector.
Endpoints associated with a shared transmit context may use dedicated receive contexts, and vice-versa. Or an endpoint may use shared transmit and receive contexts. And there is no requirement that the same group of endpoints sharing a context of one type also share the context of an alternate type. Furthermore, an endpoint may use a shared context of one type, but a scalable set of contexts of the alternate type.
This call is used to open a shareable transmit context (see above for details on the transmit context attributes). Endpoints associated with a shared transmit context must use a subset of the transmit context’s attributes. Note that this is the reverse of the requirement for transmit contexts for scalable endpoints.
This allocates a shareable receive context (see above for details on the receive context attributes). Endpoints associated with a shared receive context must use a subset of the receive context’s attributes. Note that this is the reverse of the requirement for receive contexts for scalable endpoints.
The following feature and description should be considered experimental. Until the experimental tag is removed, the interfaces, semantics, and data structures associated with socket endpoints may change between library versions.
This section applies to endpoints of type FI_EP_SOCK_STREAM and FI_EP_SOCK_DGRAM, commonly referred to as socket endpoints.
Socket endpoints are defined with semantics that allow them to more easily be adopted by developers familiar with the UNIX socket API, or by middleware that exposes the socket API, while still taking advantage of high-performance hardware features.
The key difference between socket endpoints and other active endpoints are socket endpoints use synchronous data transfers. Buffers passed into send and receive operations revert to the control of the application upon returning from the function call. As a result, no data transfer completions are reported to the application, and socket endpoints are not associated with completion queues or counters.
Socket endpoints support a subset of message operations: fi_send, fi_sendv, fi_sendmsg, fi_recv, fi_recvv, fi_recvmsg, and fi_inject. Because data transfers are synchronous, the return value from send and receive operations indicate the number of bytes transferred on success, or a negative value on error, including -FI_EAGAIN if the endpoint cannot send or receive any data because of full or empty queues, respectively.
Socket endpoints are associated with event queues and address vectors, and process connection management events asynchronously, similar to other endpoints. Unlike UNIX sockets, socket endpoint must still be declared as either active or passive.
Socket endpoints behave like non-blocking sockets. In order to support select and poll semantics, active socket endpoints are associated with a file descriptor that is signaled whenever the endpoint is ready to send and/or receive data. The file descriptor may be retrieved using fi_control.
Operation flags are obtained by OR-ing the following flags together. Operation flags define the default flags applied to an endpoint’s data transfer operations, where a flags parameter is not available. Data transfer operations that take flags as input override the op_flags value of transmit or receive context attributes of an endpoint.
Users should call fi_close to release all resources allocated to the fabric endpoint.
Endpoints allocated with the FI_CONTEXT or FI_CONTEXT2 mode bits set must typically provide struct fi_context(2) as their per operation context parameter. (See fi_getinfo.3 for details.) However, when FI_SELECTIVE_COMPLETION is enabled to suppress CQ completion entries, and an operation is initiated without the FI_COMPLETION flag set, then the context parameter is ignored. An application does not need to pass in a valid struct fi_context(2) into such data transfers.
Operations that complete in error that are not associated with valid operational context will use the endpoint context in any error reporting structures.
Although applications typically associate individual completions with either completion queues or counters, an endpoint can be attached to both a counter and completion queue. When combined with using selective completions, this allows an application to use counters to track successful completions, with a CQ used to report errors. Operations that complete with an error increment the error counter and generate a CQ completion event.
As mentioned in fi_getinfo(3), the ep_attr structure can be used to query providers that support various endpoint attributes. fi_getinfo can return provider info structures that can support the minimal set of requirements (such that the application maintains correctness). However, it can also return provider info structures that exceed application requirements. As an example, consider an application requesting msg_order as FI_ORDER_NONE. The resulting output from fi_getinfo may have all the ordering bits set. The application can reset the ordering bits it does not require before creating the endpoint. The provider is free to implement a stricter ordering than is required by the application.
Returns 0 on success. On error, a negative value corresponding to fabric errno is returned. For fi_cancel, a return value of 0 indicates that the cancel request was submitted for processing.
Fabric errno values are defined in rdma/fi_errno.h.
fi_getinfo(3), fi_domain(3), fi_cq(3) fi_msg(3), fi_tagged(3), fi_rma(3) fi_peer(3)
OpenFabrics.
2022-12-11 | Libfabric Programmer’s Manual |