librpma(7) | RPMA Programmer's Manual | librpma(7) |
librpma - remote persistent memory access library
#include <librpma.h>
cc ... -lrpma
librpma is a C library to simplify accessing persistent memory (PMem) on remote hosts over Remote Direct Memory Access (RDMA).
The librpma library provides two possible schemes of operation: Remote Memory Access and Messaging. Both of them are available over a connection established between two peers. Both of these schemes can make use of PMem as well as DRAM for the sake of building efficient and scalable Remote Persistent Memory Accessing (RPMA) applications.
The librpma library implements four basic API calls dedicated for accessing a remote memory:
All the above functions use the attribute flags to set the completion notification indicator:
All of these operations are considered as finished when the respective completion is generated.
Direct Write to PMem is a feature of a platform and its configuration which allows an RDMA-capable network interface to write data to platform's PMem in a persistent way. It may be impossible because of e.g. caching mechanisms existing on the data's way. When Direct Write to PMem is impossible, operating in the way assuming it is possible may corrupt data on PMem, so this is why Direct Write to PMem is not enabled by default.
On the current Intel platforms, the only thing you have to do in order to enable Direct Write to PMem is turning off Intel Direct Data I/O (DDIO). Sometimes, you can turn off DDIO either globally for the whole platform or for a specific PCIe Root Port. For details, please see the manual of your platform.
When you have a platform which allows Direct Write to PMem, you have to declare this is the case in your peer's configuration. The peer's configuration has to be transferred to all the peers which want to execute rpma_flush() with RPMA_FLUSH_TYPE_PERSISTENT against the platform's PMem and applied to the connection object which safeguards access to PMem.
For details on how to use these APIs please see https://github.com/pmem/rpma/tree/main/examples/05-flush-to-persistent.
A client is the active side of the process of establishing a connection. A role of the peer during the process of establishing connection does not determine direction of the data flow (neither via Remote Memory Access nor via Messaging). After establishing the connection both peers have the same capabilities.
The client, in order to establish a connection, has to perform the following steps:
After establishing the connection both peers can perform Remote Memory Access and/or Messaging over the connection.
The client, in order to close a connection, has to perform the following steps:
A server is the passive side of the process of establishing a connection. Note that after establishing the connection both peers have the same capabilities.
The server, in order to establish a connection, has to perform the following steps:
After establishing the connection both peers can perform Remote Memory Access and/or Messaging over the connection.
The server, in order to close a connection, has to perform the following steps:
When no more incoming connections are expected, the server can stop waiting for them:
Every piece of memory (either volatile or persistent) must be registered and its usage must be specified in order to be used in Remote Memory Access or Messaging. This can be done using the following memory management librpma functions:
A description of the registered memory region sometimes has to be transferred via network to the other side of the connection. In order to do that a network-transferable description of the provided memory region (called 'descriptor') has to be created using rpma_mr_get_descriptor(). On the other side of the connection the received descriptor should be decoded using rpma_mr_remote_from_descriptor(). It creates a remote memory region's structure that allows for Remote Memory Access.
The librpma messaging API allows transferring messages (buffers of arbitrary data) between the peers. Transferring messages requires preparing buffers (memory regions) on the remote side to receive the sent data. The received data are written to those dedicated buffers and the sender does not have to have a respective remote memory region object to send a message. The memory buffers used for messaging have to be registered using rpma_mr_reg() prior to rpma_send() or rpma_recv() function call.
The librpma library implements the following messaging API:
All of these operations are considered as finished when the respective completion is generated.
RDMA operations generate complitions that notify a user that the respective operation has been completed.
The following operations are available in librpma:
All operations generate completion on error. The operations posted with the RPMA_F_COMPLETION_ALWAYS flag also generate a completion on success. Completion codes are reused from the libibverbs library, where the IBV_WC_SUCCESS status indicates the successful completion of an operation. Completions are collected in the completion queue (CQ) (see the QUEUES, PERFORMANCE AND RESOURCE USE section for more details on queues).
The librpma library implements the following API for handling completions:
A peer is an abstraction representing an RDMA-capable device. All other RPMA objects have to be created in the context of a peer. A peer allows one to:
At the beginning, in order to create a peer, a user has to obtain an RDMA device context by the given IPv4/IPv6 address using rpma_utils_get_ibv_context(). Then a new peer object can be created using rpma_peer_new() and deleted using rpma_peer_delete().
By default, all endpoints and connections operate in the synchronous mode where:
are blocking calls. You can make those API calls non-blocking by modifying the respective file descriptors:
When you have a file descriptor, you can make it non-blocking using fcntl(2) as follows:
Such change makes the respective API call non-blocking automatically.
int ret = fcntl(fd, F_GETFL);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);
The provided file descriptors can also be used for scalable I/O handling like epoll(7).
Please see the example showing how to make use of RPMA file descriptors: https://github.com/pmem/rpma/tree/main/examples/06-multiple-connections
Remote Memory Access operations, Messaging operations and their Completions consume space in queues allocated in an RDMA-capable network interface (RNIC) hardware for each of the connections. You must be aware of the existence of these queues:
You must assume SQ and RQ entries occupy the place in their respective queue till:
You must also be aware that RNIC has limited resources so it is impossible to store a very long set of queues for many possibly existing connections. If all of the queues will not fit into RNIC's resources it will start using the platform's memory for this purpose. In this case, the performance will be degraded because of inevitable cache misses.
Because the length of queues has so profound impact on the performance of RPMA application you can configure the length of each of the queues separately for each of the connections:
When the connection configuration object is ready it has to be used for either rpma_conn_req_new() or rpma_ep_next_conn_req() for the settings to take effect.
The analysis of thread safety of the librpma library is described in details in the THREAD_SAFETY.md file:
https://github.com/pmem/rpma/blob/main/THREAD_SAFETY.md
On-Demand-Paging (ODP) is a technique that simplifies the memory registration process (for example, applications no longer need to pin down the underlying physical pages of the address space and track the validity of the mappings). On-Demand Paging is available if both the hardware and the kernel support it. The detailed description of ODP can be found here:
State of ODP support can be checked using the rpma_utils_ibv_context_is_odp_capable() function that queries the RDMA device context's capabilities and checks if it supports On-Demand Paging.
https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
The librpma library uses ODP automatically if it is supported. ODP support is required to register PMem memory region mapped from File System DAX (FSDAX).
If a librpma function may fail, it returns a negative error code. Checking if the returned value is non-negative is the only programmatically available way to verify if the API call succeeded. The exact meaning of all error codes is described in the manual of each function.
The librpma library implements the logging API which may give additional information in case of an error and during normal operation as well, according to the current logging threshold levels.
The function that will handle all generated log messages can be set using rpma_log_set_function(). The logging function can be either the default logging function (built into the library) or a user-defined, thread-safe, function. The default logging function can write messages to syslog(3) and stderr(3). The logging threshold level can be set or got using rpma_log_set_threshold() or rpma_log_get_threshold() respectively.
There is an example of the usage of the logging functions: https://github.com/pmem/rpma/tree/main/examples/log
See https://github.com/pmem/rpma/tree/main/examples for examples of using the librpma API.
librpma is built on the top of libibverbs and librdmacm APIs.
Using of the API calls which are marked as deprecated should be avoided, because they will be removed in a new major release.
NOTE: API calls deprecated in 0.X release will be removed in 0.(X+1) release usually.
https://pmem.io/rpma/
10 January 2023 | RPMA |