fabric(7) | #VERSION# | fabric(7) |
fabric - Fabric Interface Library
#include <rdma/fabric.h>
Libfabric is a high-performance fabric software library designed to provide low-latency interfaces to fabric hardware.
Libfabric provides `process direct I/O' to application software communicating across fabric software and hardware. Process direct I/O, historically referred to as RDMA, allows an application to directly access network resources without operating system interventions. Data transfers can occur directly to and from application memory.
There are two components to the libfabric software:
The fabric interfaces are designed such that they are cohesive and not simply a union of disjoint interfaces. The interfaces are logically divided into two groups: control interfaces and communication operations. The control interfaces are a common set of operations that provide access to local communication resources, such as address vectors and event queues. The communication operations expose particular models of communication and fabric functionality, such as message queues, remote memory access, and atomic operations. Communication operations are associated with fabric endpoints.
Applications will typically use the control interfaces to discover local capabilities and allocate necessary resources. They will then allocate and configure a communication endpoint to send and receive data, or perform other types of data transfers, with remote endpoints.
The control interfaces APIs provide applications access to network resources. This involves listing all the interfaces available, obtaining the capabilities of the interfaces and opening a provider.
fi_getinfo returns a list of fi_info structures. Each structure references a single fabric provider, indicating the interfaces that the provider supports, along with a named set of resources. A fabric provider may include multiple fi_info structures in the returned list.
Fabric endpoints are associated with multiple data transfer interfaces. Each interface set is designed to support a specific style of communication, with an endpoint allowing the different interfaces to be used in conjunction. The following data transfer interfaces are defined by libfabric.
Logging can be controlled using the FI_LOG_LEVEL, FI_LOG_PROV, and FI_LOG_SUBSYS environment variables.
Example: To enable logging from the psm and sockets provider: FI_LOG_PROV=“psm,sockets”
Example: To enable logging from providers other than psm: FI_LOG_PROV=“^psm”
The libfabric build scripts will install all providers that are supported by the installation system. Providers that are missing build prerequisites will be disabled. Installed providers will dynamically check for necessary hardware on library initialization and respond appropriately to application queries.
Users can enable or disable available providers through build configuration options. See `configure –help' for details. In general, a specific provider can be controlled using the configure option `–enable-'. For example, `–enable-udp' (or `–enable-udp=yes') will add the udp provider to the build. To disable the provider, `–enable-udp=no' can be used.
Providers can also be enable or disabled at run time using the FI_PROVIDER environment variable. The FI_PROVIDER variable is set to a comma separated list of providers to include. If the list begins with the `^' symbol, then the list will be negated.
Example: To enable the udp and tcp providers only, set: FI_PROVIDER=“udp,tcp”
The fi_info utility, which is included as part of the libfabric package, can be used to retrieve information about which providers are available in the system. Additionally, it can retrieve a list of all environment variables that may be used to configure libfabric and each provider. See fi_info(1) for more details.
Core features of libfabric and its providers may be configured by an administrator through the use of environment variables. Man pages will usually describe the most commonly accessed variables, such as those mentioned above. However, libfabric defines interfaces for publishing and obtaining environment variables. These are targeted for providers, but allow applications and users to obtain the full list of variables that may be set, along with a brief description of their use.
A full list of variables available may be obtained by running the fi_info application, with the -e or –env command line option.
Because libfabric is designed to provide applications direct access to fabric hardware, there are limits on how libfabric resources may be used in conjunction with system calls. These limitations are notable for developers who may be familiar programming to the sockets interface. Although limits are provider specific, the following restrictions apply to many providers and should be adhered to by applications desiring portability across providers.
In some cases, calls to cudaMemcpy within libfabric may result in a deadlock. This typically occurs when a CUDA kernel blocks until a cudaMemcpy on the host completes. To avoid this deadlock, cudaMemcpy may be disabled by setting FI_HMEM_CUDA_ENABLE_XFER=0. If this environment variable is set and there is a call to cudaMemcpy with libfabric, a warning will be emitted and no copy will occur. Note that not all providers support this option.
Another mechanism which can be used to avoid deadlock is Nvidia’s gdrcopy. Using gdrcopy requires an external library and kernel module available at https://github.com/NVIDIA/gdrcopy. Libfabric must be configured with gdrcopy support using the --with-gdrcopy option, and be run with FI_HMEM_CUDA_USE_GDRCOPY=1. This may be used in conjunction with the above option to provide a method for copying to/from CUDA device memory when cudaMemcpy cannot be used. Again, this may not be supported by all providers.
libfabric releases maintain compatibility with older releases, so that compiled applications can continue to work as-is, and previously written applications will compile against newer versions of the library without needing source code changes. The changes below describe ABI updates that have occurred and which libfabric release corresponds to the changes.
Note that because most functions called by applications actually call static inline functions, which in turn reference function pointers in order to call directly into providers, libfabric only exports a handful of functions directly. ABI changes are limited to those functions, most notably the fi_getinfo call and its returned attribute structures.
The ABI version is independent from the libfabric release version.
The initial libfabric release (1.0.0) also corresponds to ABI version 1.0. The 1.0 ABI was unchanged for libfabric major.minor versions 1.0, 1.1, 1.2, 1.3, and 1.4.
A number of external data structures were appended starting with libfabric version 1.5. These changes included adding the fields to the following data structures. The 1.1 ABI was exported by libfabric versions 1.5 and 1.6.
The 1.2 ABI version was exported by libfabric versions 1.7 and 1.8, and expanded the following structure.
The 1.3 ABI version was exported by libfabric versions 1.9, 1.10, and 1.11. Added new fields to the following attributes:
The 1.4 ABI version was exported by libfabric 1.12. Added fi_tostr_r, a thread-safe (re-entrant) version of fi_tostr.
ABI version starting with libfabric 1.13. Added new fi_open API call.
ABI version starting with libfabric 1.14. Added fi_log_ready for providers.
fi_info(1), fi_provider(7), fi_getinfo(3), fi_endpoint(3), fi_domain(3), fi_av(3), fi_eq(3), fi_cq(3), fi_cntr(3), fi_mr(3)
OpenFabrics.
2022-12-11 | Libfabric Programmer’s Manual |