fi_cq(3) | #VERSION# | fi_cq(3) |
fi_cq - Completion queue operations
#include <rdma/fi_domain.h> int fi_cq_open(struct fid_domain *domain, struct fi_cq_attr *attr,
struct fid_cq **cq, void *context); int fi_close(struct fid *cq); int fi_control(struct fid *cq, int command, void *arg); ssize_t fi_cq_read(struct fid_cq *cq, void *buf, size_t count); ssize_t fi_cq_readfrom(struct fid_cq *cq, void *buf, size_t count,
fi_addr_t *src_addr); ssize_t fi_cq_readerr(struct fid_cq *cq, struct fi_cq_err_entry *buf,
uint64_t flags); ssize_t fi_cq_sread(struct fid_cq *cq, void *buf, size_t count,
const void *cond, int timeout); ssize_t fi_cq_sreadfrom(struct fid_cq *cq, void *buf, size_t count,
fi_addr_t *src_addr, const void *cond, int timeout); int fi_cq_signal(struct fid_cq *cq); const char * fi_cq_strerror(struct fid_cq *cq, int prov_errno,
const void *err_data, char *buf, size_t len);
Completion queues are used to report events associated with data transfers. They are associated with message sends and receives, RMA, atomic, tagged messages, and triggered events. Reported events are usually associated with a fabric endpoint, but may also refer to memory regions used as the target of an RMA or atomic operation.
fi_cq_open allocates a new completion queue. Unlike event queues, completion queues are associated with a resource domain and may be offloaded entirely in provider hardware.
The properties and behavior of a completion queue are defined by struct fi_cq_attr.
struct fi_cq_attr {
size_t size; /* # entries for CQ */
uint64_t flags; /* operation flags */
enum fi_cq_format format; /* completion format */
enum fi_wait_obj wait_obj; /* requested wait object */
int signaling_vector; /* interrupt affinity */
enum fi_cq_wait_cond wait_cond; /* wait condition format */
struct fid_wait *wait_set; /* optional wait set */ };
struct fi_cq_entry {• .RS 2
void *op_context; /* operation context */ };
struct fi_cq_msg_entry {• .RS 2
void *op_context; /* operation context */
uint64_t flags; /* completion flags */
size_t len; /* size of received data */ };
struct fi_cq_data_entry {• .RS 2
void *op_context; /* operation context */
uint64_t flags; /* completion flags */
size_t len; /* size of received data */
void *buf; /* receive data buffer */
uint64_t data; /* completion data */ };
struct fi_cq_tagged_entry {
void *op_context; /* operation context */
uint64_t flags; /* completion flags */
size_t len; /* size of received data */
void *buf; /* receive data buffer */
uint64_t data; /* completion data */
uint64_t tag; /* received tag */ };
A wait condition should be treated as an optimization. Providers are not required to meet the requirements of the condition before signaling the wait object. Applications should not rely on the condition necessarily being true when a blocking read call returns.
If wait_cond is set to FI_CQ_COND_NONE, then no additional conditions are applied to the signaling of the CQ wait object, and the insertion of any new entry will trigger the wait condition. If wait_cond is set to FI_CQ_COND_THRESHOLD, then the cond field is interpreted as a size_t threshold value. The threshold indicates the number of entries that are to be queued before at the CQ before the wait is satisfied.
This field is ignored if wait_obj is set to FI_WAIT_NONE.
The fi_close call releases all resources associated with a completion queue. Any completions which remain on the CQ when it is closed are lost.
When closing the CQ, there must be no opened endpoints, transmit contexts, or receive contexts associated with the CQ. If resources are still associated with the CQ when attempting to close, the call will return -FI_EBUSY.
The fi_control call is used to access provider or implementation specific details of the completion queue. Access to the CQ should be serialized across all calls when fi_control is invoked, as it may redirect the implementation of CQ operations. The following control commands are usable with a CQ.
The fi_cq_read operation performs a non-blocking read of completion data from the CQ. The format of the completion event is determined using the fi_cq_format option that was specified when the CQ was opened. Multiple completions may be retrieved from a CQ in a single call. The maximum number of entries to return is limited to the specified count parameter, with the number of entries successfully read from the CQ returned by the call. (See return values section below.) A count value of 0 may be used to drive progress on associated endpoints when manual progress is enabled.
CQs are optimized to report operations which have completed successfully. Operations which fail are reported `out of band'. Such operations are retrieved using the fi_cq_readerr function. When an operation that has completed with an unexpected error is encountered, it is placed into a temporary error queue. Attempting to read from a CQ while an item is in the error queue results in fi_cq_read failing with a return code of -FI_EAVAIL. Applications may use this return code to determine when to call fi_cq_readerr.
The fi_cq_readfrom call behaves identical to fi_cq_read, with the exception that it allows the CQ to return source address information to the user for any received data. Source address data is only available for those endpoints configured with FI_SOURCE capability. If fi_cq_readfrom is called on an endpoint for which source addressing data is not available, the source address will be set to FI_ADDR_NOTAVAIL. The number of input src_addr entries must be the same as the count parameter.
Returned source addressing data is converted from the native address used by the underlying fabric into an fi_addr_t, which may be used in transmit operations. Under most circumstances, returning fi_addr_t requires that the source address already have been inserted into the address vector associated with the receiving endpoint. This is true for address vectors of type FI_AV_TABLE. In select providers when FI_AV_MAP is used, source addresses may be converted algorithmically into a usable fi_addr_t, even though the source address has not been inserted into the address vector. This is permitted by the API, as it allows the provider to avoid address look-up as part of receive message processing. In no case do providers insert addresses into an AV separate from an application calling fi_av_insert or similar call.
For endpoints allocated using the FI_SOURCE_ERR capability, if the source address cannot be converted into a valid fi_addr_t value, fi_cq_readfrom will return -FI_EAVAIL, even if the data were received successfully. The completion will then be reported through fi_cq_readerr with error code -FI_EADDRNOTAVAIL. See fi_cq_readerr for details.
If FI_SOURCE is specified without FI_SOURCE_ERR, source addresses which cannot be mapped to a usable fi_addr_t will be reported as FI_ADDR_NOTAVAIL.
The fi_cq_sread and fi_cq_sreadfrom calls are the blocking equivalent operations to fi_cq_read and fi_cq_readfrom. Their behavior is similar to the non-blocking calls, with the exception that the calls will not return until either a completion has been read from the CQ or an error or timeout occurs.
Threads blocking in this function will return to the caller if they are signaled by some external source. This is true even if the timeout has not occurred or was specified as infinite.
It is invalid for applications to call these functions if the CQ has been configured with a wait object of FI_WAIT_NONE or FI_WAIT_SET.
The read error function, fi_cq_readerr, retrieves information regarding any asynchronous operation which has completed with an unexpected error. fi_cq_readerr is a non-blocking call, returning immediately whether an error completion was found or not.
Error information is reported to the user through struct fi_cq_err_entry. The format of this structure is defined below.
struct fi_cq_err_entry {
void *op_context; /* operation context */
uint64_t flags; /* completion flags */
size_t len; /* size of received data */
void *buf; /* receive data buffer */
uint64_t data; /* completion data */
uint64_t tag; /* message tag */
size_t olen; /* overflow length */
int err; /* positive error code */
int prov_errno; /* provider error code */
void *err_data; /* error data */
size_t err_data_size; /* size of err_data */ };
The general reason for the error is provided through the err field. Provider specific error information may also be available through the prov_errno and err_data fields. Users may call fi_cq_strerror to convert provider specific error information into a printable string for debugging purposes. See field details below for more information on the use of err_data and err_data_size.
Note that error completions are generated for all operations, including those for which a completion was not requested (e.g. an endpoint is configured with FI_SELECTIVE_COMPLETION, but the request did not have the FI_COMPLETION flag set). In such cases, providers will return as much information as made available by the underlying software and hardware about the failure, other fields will be set to NULL or 0. This includes the op_context value, which may not have been provided or was ignored on input as part of the transfer.
Notable completion error codes are given below.
The fi_cq_signal call will unblock any thread waiting in fi_cq_sread or fi_cq_sreadfrom. This may be used to wake-up a thread that is blocked waiting to read a completion operation. The fi_cq_signal operation is only available if the CQ was configured with a wait object.
The CQ entry data structures share many of the same fields. The meanings of these fields are the same for all CQ entry structure formats.
For completion events that are not associated with a posted operation, this field will be set to NULL. This includes completions generated at the target in response to RMA write operations that carry CQ data (FI_REMOTE_WRITE | FI_REMOTE_CQ_DATA flags set), when the FI_RX_CQ_DATA mode bit is not required.
For compatibility purposes, the behavior of the err_data and err_data_size fields is may be modified from that listed above. If err_data_size is 0 on input, or the fabric was opened with release < 1.5, then any buffer referenced by err_data will be ignored on input. In this situation, on output err_data will be set to a data buffer owned by the provider. The contents of the buffer will remain valid until a subsequent read call against the CQ. Applications must serialize access to the CQ when processing errors to ensure that the buffer referenced by err_data does not change.
Completion flags provide additional details regarding the completed operation. The following completion flags are defined.
Applications can distinguish between these two cases by examining the completion entry flags field. If additional flags, such as FI_RECV, are set, the completion is associated with a received message. In this case, the buf field will reference the location where the received message was placed into the multi-recv buffer. Other fields in the completion entry will be determined based on the received message. If other flag bits are zero, the provider is reporting that the multi-recv buffer has been released, and the completion entry is not associated with a received message.
Libfabric defines several completion `levels', identified using operational flags. Each flag indicates the soonest that a completion event may be generated by a provider, and the assumptions that an application may make upon processing a completion. The operational flags are defined below, along with an example of how a provider might implement the semantic. Note that only meeting the semantic is required of the provider and not the implementation. Providers may implement stronger completion semantics than necessary for a given operation, but only the behavior defined by the completion level is guaranteed.
To help understand the conceptual differences in completion levels, consider mailing a letter. Placing the letter into the local mailbox for pick-up is similar to `inject complete'. Having the letter picked up and dropped off at the destination mailbox is equivalent to `transmit complete'. The `delivery complete' semantic is a stronger guarantee, with a person at the destination signing for the letter. However, the person who signed for the letter is not necessarily the intended recipient. The `match complete' option is similar to delivery complete, but requires the intended recipient to sign for the letter.
The `commit complete' level has different semantics than the previously mentioned levels. Commit complete would be closer to the letter arriving at the destination and being placed into a fire proof safe.
The operational flags for the described completion levels are defined below.
Example: A provider may generate this completion event after copying the source buffer into a network buffer, either in host memory or on the NIC. An inject completion does not indicate that the data has been transmitted onto the network, and a local error could occur after the completion event has been generated that could prevent it from being transmitted.
Inject complete allows for the fastest completion reporting (and, hence, buffer reuse), but provides the weakest guarantees against network errors.
Note: This flag is used to control when a completion entry is inserted into a completion queue. It does not apply to operations that do not generate a completion queue entry, such as the fi_inject operation, and is not subject to the inject_size message limit restriction.
For reliable endpoints:
Indicates that a completion should be generated when the operation has been delivered to the peer endpoint. A completion guarantees that the operation is no longer dependent on the fabric or local resources. The state of the operation at the peer endpoint is not defined.
Example: A provider may generate a transmit complete event upon receiving an ack from the peer endpoint. The state of the message at the peer is unknown and may be buffered in the target NIC at the time the ack has been generated.
For unreliable endpoints:
Indicates that a completion should be generated when the operation has been delivered to the fabric. A completion guarantees that the operation is no longer dependent on local resources. The state of the operation within the fabric is not defined.
Delivery complete indicates that the message has been processed by the peer. If an application buffer was ready to receive the results of the message when it arrived, then delivery complete indicates that the data was placed into the application’s buffer.
This completion mode applies only to reliable endpoints. For operations that return data to the initiator, such as RMA read or atomic-fetch, the source endpoint is also considered a destination endpoint. This is the default completion mode for such operations.
This completion mode applies only to operations that target persistent memory regions over reliable endpoints. This completion mode is experimental.
Note that a completion generated for an operation posted prior to the fenced operation only guarantees that the completion level that was originally requested has been met. It is the completion of the fenced operation that guarantees that the additional semantics have been met.
The above completion semantics are defined with respect to the initiator of the operation. The different semantics are useful for describing when the initiator may re-use a data buffer, and guarantees what state a transfer must reach prior to a completion being generated. This allows applications to determine appropriate error handling in case of communication failures.
The completion semantic at the target is used to determine when data at the target is visible to the peer application. Visibility indicates that a memory read to the same address that was the target of a data transfer will return the results of the transfer. The target of a transfer can be identified by the initiator, as may be the case for RMA and atomic operations, or determined by the target, for example by providing a matching receive buffer. Global visibility indicates that the results are available regardless of where the memory read originates. For example, the read could come from a process running on a host CPU, it may be accessed by subsequent data transfer over the fabric, or read from a peer device such as a GPU.
In terms of completion semantics, visibility usually indicates that the transfer meets the FI_DELIVERY_COMPLETE requirements from the perspective of the target. The target completion semantic may be, but is not necessarily, linked with the completion semantic specified by the initiator of the transfer.
Often, target processes do not explicitly state a desired completion semantic and instead rely on the default semantic. The default behavior is based on several factors, including:
Broadly, target completion semantics are grouped based on whether or not the transfer generates a completion event at the target. This includes writing a CQ entry or updating a completion counter. In common use cases, transfers that use a message interface (FI_MSG or FI_TAGGED) typically generate target events, while transfers involving an RMA interface (FI_RMA or FI_ATOMIC) often do not. There are exceptions to both these cases, depending on endpoint to CQ and counter bindings and operational flags. For example, RMA writes that carry remote CQ data will generate a completion event at the target, and are frequently used to convey visibility to the target application. The general guidelines for target side semantics are described below, followed by exceptions that modify that behavior.
By default, completions generated at the target indicate that the transferred data is immediately available to be read from the target buffer. That is, the target sees FI_DELIVERY_COMPLETE (or better) semantics, even if the initiator requested lower semantics. For applications using only data buffers allocated from host memory, this is often sufficient.
For operations that do not generate a completion event at the target, the visibility of the data at the target may need to be inferred based on subsequent operations that do generate target completions. Absent a target completion, when a completion of an operation is written at the initiator, the visibility semantic of the operation at the target aligns with the initiator completion semantic. For instance, if an RMA operation completes at the initiator as either FI_INJECT_COMPLETE or FI_TRANSMIT_COMPLETE, the data visibility at the target is not guaranteed.
One or more of the following mechanisms can be used by the target process to guarantee that the results of a data transfer that did not generate a completion at the target is now visible. This list is not inclusive of all options, but defines common uses. In the descriptions below, the first transfer does not result in a completion event at the target, but is eventually followed by a transfer which does.
The above semantics apply for transfers targeting traditional host memory buffers. However, the behavior may differ when device memory and/or persistent memory is involved (FI_HMEM and FI_PMEM capability bits). When heterogenous memory is involved, the concept of memory domains come into play. Memory domains identify the physical separation of memory, which may or may not be accessible through the same virtual address space. See the fi_mr(3) man page for further details on memory domains.
Completion ordering and data visibility are only well-defined for transfers that target the same memory domain. Applications need to be aware of ordering and visibility differences when transfers target different memory domains. Additionally, applications also need to be concerned with the memory domain that completions themselves are written and if it differs from the memory domain targeted by a transfer. In some situations, either the provider or application may need to call device specific APIs to synchronize or flush device memory caches in order to achieve the desired data visibility.
When heterogenous memory is in use, the default target completion semantic for transfers that generate a completion at the target is still FI_DELIVERY_COMPLETE, however, applications should be aware that there may be a negative impact on overall performance for providers to meet this requirement.
For example, a target process may be using a GPU to accelerate computations. A memory region mapping to memory on the GPU may be exposed to peers as either an RMA target or posted locally as a receive buffer. In this case, the application is concerned with two memory domains – system and GPU memory. Completions are written to system memory.
Continuing the example, a peer process sends a tagged message. That message is matched with the receive buffer located in GPU memory. The NIC copies the data from the network into the receive buffer and writes an entry into the completion queue. Note that both memory domains were accessed as part of this transfer. The message data was directed to the GPU memory, but the completion went to host memory. Because separate memory domains may not be synchronized with each other, it is possible for the host CPU to see and process the completion entry before the transfer to the GPU memory is visible to either the host GPU or even software running on the GPU. From the perspective of the provider, visibility of the completion does not imply visibility of data written to the GPU’s memory domain.
The default completion semantic at the target application for message operations is FI_DELIVERY_COMPLETE. An anticipated provider implementation in this situation is for the provider software running on the host CPU to intercept the CQ entry, detect that the data landed in heterogenous memory, and perform the necessary device synchronization or flush operation before reporting the completion up to the application. This ensures that the data is visible to CPU and GPU software prior to the application processing the completion.
In addition to the cost of provider software intercepting completions and checking if a transfer targeted heterogenous memory, device synchronization itself may impact performance. As a result, applications can request a lower completion semantic when posting receives. That indicates to the provider that the application will be responsible for handling any device specific flush operations that might be needed. See fi_msg(3) FLAGS.
For data transfers that do not generate a completion at the target, such as RMA or atomics, it is the responsibility of the application to ensure that all target buffers meet the necessary visibility requirements of the application. The previously mentioned bulleted methods for notifying the target that the data is visible may not be sufficient, as the provider software at the target could lack the context needed to ensure visibility. This implies that the application may need to call device synchronization/flush APIs directly.
For example, a peer application could perform several RMA writes that target GPU memory buffers. If the provider offloads RMA operations into the NIC, the provider software at the target will be unaware that the RMA operations have occurred. If the peer sends a message to the target application that indicates that the RMA operations are done, the application must ensure that the RMA data is visible to the host CPU or GPU prior to executing code that accesses the data. The target completion of having received the sent message is not sufficient, even if send-after-write ordering is supported.
Most target heterogenous memory completion semantics map to FI_TRANSMIT_COMPLETE or FI_DELIVERY_COMPLETE. Persistent memory (FI_PMEM capability), however, is often used with FI_COMMIT_COMPLETE semantics. Heterogenous completion concepts still apply.
For transfers flagged by the initiator with FI_COMMIT_COMPLETE, a completion at the target indicates that the results are visible and durable. For transfers targeting persistent memory, but using a different completion semantic at the initiator, the visibility at the target is similar to that described above. Durability is only associated with transfers marked with FI_COMMIT_COMPLETE.
For transfers targeting persistent memory that request FI_DELIVERY_COMPLETE, then a completion, at either the initiator or target, indicates that the data is visible. Visibility at the target can be conveyed using one of the above describe mechanism – generating a target completion, sending a message from the initiator, etc. Similarly, if the initiator requested FI_TRANSMIT_COMPLETE, then additional steps are needed to ensure visibility at the target. For example, the transfer can generate a completion at the target, which would indicate visibility, but not durability. The initiator can also follow the transfer with another operation that forces visibility, such as using FI_FENCE in conjunction with FI_DELIVERY_COMPLETE.
A completion queue must be bound to at least one enabled endpoint before any operation such as fi_cq_read, fi_cq_readfrom, fi_cq_sread, fi_cq_sreadfrom etc. can be called on it.
Completion flags may be suppressed if the FI_NOTIFY_FLAGS_ONLY mode bit has been set. When enabled, only the following flags are guaranteed to be set in completion data when they are valid: FI_REMOTE_READ and FI_REMOTE_WRITE (when FI_RMA_EVENT capability bit has been set), FI_REMOTE_CQ_DATA, and FI_MULTI_RECV.
If a completion queue has been overrun, it will be placed into an `overrun' state. Read operations will continue to return any valid, non-corrupted completions, if available. After all valid completions have been retrieved, any attempt to read the CQ will result in it returning an FI_EOVERRUN error event. Overrun completion queues are considered fatal and may not be used to report additional completions once the overrun occurs.
: Returns 0 on success. On error, returns a negative fabric errno.
: On success, returns the number of completions retrieved from the completion queue. On error, returns a negative fabric errno, with these two errors explicitly identified: If no completions are available to read from the CQ, returns -FI_EAGAIN. If the topmost completion is for a failed transfer (an error entry), returns -FI_EAVAIL.
: On success, returns the number of completions retrieved from the completion queue. On error, returns a negative fabric errno, with these two errors explicitly identified: If the timeout expires or the calling thread is signaled and no data is available to be read from the completion queue, returns -FI_EAGAIN. If the topmost completion is for a failed transfer (an error entry), returns -FI_EAVAIL.
: On success, returns the positive value 1 (number of error entries returned). On error, returns a negative fabric errno, with this error explicitly identified: If no error completions are available to read from the CQ, returns -FI_EAGAIN.
: Returns a character string interpretation of the provider specific error returned with a completion.
Fabric errno values are defined in rdma/fi_errno.h.
fi_getinfo(3), fi_endpoint(3), fi_domain(3), fi_eq(3), fi_cntr(3), fi_poll(3)
OpenFabrics.
2022-12-11 | Libfabric Programmer’s Manual |