The io_uring_setup(2) system call sets up a submission
queue (SQ) and completion queue (CQ) with at least entries entries,
and returns a file descriptor which can be used to perform subsequent
operations on the io_uring instance. The submission and completion queues
are shared between userspace and the kernel, which eliminates the need to
copy data when initiating and completing I/O.
params is used by the application to pass options to the
kernel, and by the kernel to convey information about the ring buffers.
struct io_uring_params {
__u32 sq_entries;
__u32 cq_entries;
__u32 flags;
__u32 sq_thread_cpu;
__u32 sq_thread_idle;
__u32 features;
__u32 wq_fd;
__u32 resv[3];
struct io_sqring_offsets sq_off;
struct io_cqring_offsets cq_off;
};
The flags, sq_thread_cpu, and sq_thread_idle
fields are used to configure the io_uring instance. flags is a bit
mask of 0 or more of the following values ORed together:
- IORING_SETUP_IOPOLL
- Perform busy-waiting for an I/O completion, as opposed to getting
notifications via an asynchronous IRQ (Interrupt Request). The file system
(if any) and block device must support polling in order for this to work.
Busy-waiting provides lower latency, but may consume more CPU resources
than interrupt driven I/O. Currently, this feature is usable only on a
file descriptor opened using the O_DIRECT flag. When a read or
write is submitted to a polled context, the application must poll for
completions on the CQ ring by calling io_uring_enter(2). It is
illegal to mix and match polled and non-polled I/O on an io_uring
instance.
- IORING_SETUP_SQPOLL
- When this flag is specified, a kernel thread is created to perform
submission queue polling. An io_uring instance configured in this way
enables an application to issue I/O without ever context switching into
the kernel. By using the submission queue to fill in new submission queue
entries and watching for completions on the completion queue, the
application can submit and reap I/Os without doing a single system call.
If the kernel thread is idle for more than
sq_thread_idle milliseconds, it will set the
IORING_SQ_NEED_WAKEUP bit in the flags field of the
struct io_sq_ring. When this happens, the application must call
io_uring_enter(2) to wake the kernel thread. If I/O is kept busy,
the kernel thread will never sleep. An application making use of this
feature will need to guard the io_uring_enter(2) call with the
following code sequence:
/*
* Ensure that the wakeup flag is read after the tail pointer
* has been written. It's important to use memory load acquire
* semantics for the flags read, as otherwise the application
* and the kernel might not agree on the consistency of the
* wakeup flag.
*/
unsigned flags = atomic_load_relaxed(sq_ring->flags);
if (flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
where sq_ring is a submission queue ring setup using
the struct io_sqring_offsets described below.
- Before version 5.11 of the Linux kernel, to successfully use this feature,
the application must register a set of files to be used for IO through
io_uring_register(2) using the IORING_REGISTER_FILES opcode.
Failure to do so will result in submitted IO being errored with
EBADF. The presence of this feature can be detected by the
IORING_FEAT_SQPOLL_NONFIXED feature flag. In version 5.11 and
later, it is no longer necessary to register files to use this feature.
5.11 also allows using this as non-root, if the user has the
CAP_SYS_NICE capability.
- IORING_SETUP_SQ_AFF
- If this flag is specified, then the poll thread will be bound to the cpu
set in the sq_thread_cpu field of the struct
io_uring_params. This flag is only meaningful when
IORING_SETUP_SQPOLL is specified. When cgroup setting
cpuset.cpus changes (typically in container environment), the
bounded cpu set may be changed as well.
- IORING_SETUP_CQSIZE
- Create the completion queue with struct io_uring_params.cq_entries
entries. The value must be greater than entries, and may be rounded
up to the next power-of-two.
- IORING_SETUP_CLAMP
- If this flag is specified, and if entries exceeds
IORING_MAX_ENTRIES , then entries will be clamped at
IORING_MAX_ENTRIES . If the flag IORING_SETUP_SQPOLL is set,
and if the value of struct io_uring_params.cq_entries exceeds
IORING_MAX_CQ_ENTRIES , then it will be clamped at
IORING_MAX_CQ_ENTRIES .
- IORING_SETUP_ATTACH_WQ
- This flag should be set in conjunction with struct
io_uring_params.wq_fd being set to an existing io_uring ring file
descriptor. When set, the io_uring instance being created will share the
asynchronous worker thread backend of the specified io_uring ring, rather
than create a new separate thread pool.
- IORING_SETUP_R_DISABLED
- If this flag is specified, the io_uring ring starts in a disabled state.
In this state, restrictions can be registered, but submissions are not
allowed. See io_uring_register(2) for details on how to enable the
ring. Available since 5.10.
- IORING_SETUP_SUBMIT_ALL
- Normally io_uring stops submitting a batch of request, if one of these
requests results in an error. This can cause submission of less than what
is expected, if a request ends in error while being submitted. If the ring
is created with this flag, io_uring_enter(2) will continue
submitting requests even if it encounters an error submitting a request.
CQEs are still posted for errored request regardless of whether or not
this flag is set at ring creation time, the only difference is if the
submit sequence is halted or continued when an error is observed.
Available since 5.18.
- IORING_SETUP_COOP_TASKRUN
- By default, io_uring will interrupt a task running in userspace when a
completion event comes in. This is to ensure that completions run in a
timely manner. For a lot of use cases, this is overkill and can cause
reduced performance from both the inter-processor interrupt used to do
this, the kernel/user transition, the needless interruption of the tasks
userspace activities, and reduced batching if completions come in at a
rapid rate. Most applications don't need the forceful interruption, as the
events are processed at any kernel/user transition. The exception are
setups where the application uses multiple threads operating on the same
ring, where the application waiting on completions isn't the one that
submitted them. For most other use cases, setting this flag will improve
performance. Available since 5.19.
- IORING_SETUP_TASKRUN_FLAG
- Used in conjunction with IORING_SETUP_COOP_TASKRUN, this provides a
flag, IORING_SQ_TASKRUN, which is set in the SQ ring flags
whenever completions are pending that should be processed. liburing will
check for this flag even when doing io_uring_peek_cqe(3) and enter
the kernel to process them, and applications can do the same. This makes
IORING_SETUP_TASKRUN_FLAG safe to use even when applications rely
on a peek style operation on the CQ ring to see if anything might be
pending to reap. Available since 5.19.
- IORING_SETUP_SQE128
- If set, io_uring will use 128-byte SQEs rather than the normal 64-byte
sized variant. This is a requirement for using certain request types, as
of 5.19 only the IORING_OP_URING_CMD passthrough command for NVMe
passthrough needs this. Available since 5.19.
- IORING_SETUP_CQE32
- If set, io_uring will use 32-byte CQEs rather than the normal 16-byte
sized variant. This is a requirement for using certain request types, as
of 5.19 only the IORING_OP_URING_CMD passthrough command for NVMe
passthrough needs this. Available since 5.19.
- IORING_SETUP_SINGLE_ISSUER
- A hint to the kernel that only a single task (or thread) will submit
requests, which is used for internal optimisations. The submission task is
either the task that created the ring, or if
IORING_SETUP_R_DISABLED is specified then it is the task that
enables the ring through io_uring_register(2). The kernel
enforces this rule, failing requests with -EEXIST if the
restriction is violated. Note that when IORING_SETUP_SQPOLL is set
it is considered that the polling task is doing all submissions on behalf
of the userspace and so it always complies with the rule disregarding how
many userspace tasks do io_uring_enter(2). Available since
6.0.
- IORING_SETUP_DEFER_TASKRUN
- By default, io_uring will process all outstanding work at the end of any
system call or thread interrupt. This can delay the application from
making other progress. Setting this flag will hint to io_uring that it
should defer work until an io_uring_enter(2) call with the
IORING_ENTER_GETEVENTS flag set. This allows the application to
request work to run just before it wants to process completions. This flag
requires the IORING_SETUP_SINGLE_ISSUER flag to be set, and also
enforces that the call to io_uring_enter(2) is called from the same
thread that submitted requests. Note that if this flag is set then it is
the application's responsibility to periodically trigger work (for example
via any of the CQE waiting functions) or else completions may not be
delivered. Available since 6.1.
If no flags are specified, the io_uring instance is setup for
interrupt driven I/O. I/O may be submitted using io_uring_enter(2)
and can be reaped by polling the completion queue.
The resv array must be initialized to zero.
features is filled in by the kernel, which specifies
various features supported by current kernel version.
- IORING_FEAT_SINGLE_MMAP
- If this flag is set, the two SQ and CQ rings can be mapped with a single
mmap(2) call. The SQEs must still be allocated separately. This
brings the necessary mmap(2) calls down from three to two.
Available since kernel 5.4.
- IORING_FEAT_NODROP
- If this flag is set, io_uring supports almost never dropping completion
events. If a completion event occurs and the CQ ring is full, the kernel
stores the event internally until such a time that the CQ ring has room
for more entries. If this overflow condition is entered, attempting to
submit more IO will fail with the -EBUSY error value, if it can't
flush the overflown events to the CQ ring. If this happens, the
application must reap events from the CQ ring and attempt the submit
again. If the kernel has no free memory to store the event internally it
will be visible by an increase in the overflow value on the cqring.
Available since kernel 5.5. Additionally io_uring_enter(2) will
return -EBADR the next time it would otherwise sleep waiting for
completions (since kernel 5.19).
- IORING_FEAT_SUBMIT_STABLE
- If this flag is set, applications can be certain that any data for async
offload has been consumed when the kernel has consumed the SQE. Available
since kernel 5.5.
- IORING_FEAT_RW_CUR_POS
- If this flag is set, applications can specify offset == -1
with IORING_OP_{READV,WRITEV} , IORING_OP_{READ,WRITE}_FIXED
, and IORING_OP_{READ,WRITE} to mean current file position, which
behaves like preadv2(2) and pwritev2(2) with offset
== -1. It'll use (and update) the current file position. This
obviously comes with the caveat that if the application has multiple reads
or writes in flight, then the end result will not be as expected. This is
similar to threads sharing a file descriptor and doing IO using the
current file position. Available since kernel 5.6.
- IORING_FEAT_CUR_PERSONALITY
- If this flag is set, then io_uring guarantees that both sync and async
execution of a request assumes the credentials of the task that called
io_uring_enter(2) to queue the requests. If this flag isn't set,
then requests are issued with the credentials of the task that originally
registered the io_uring. If only one task is using a ring, then this flag
doesn't matter as the credentials will always be the same. Note that this
is the default behavior, tasks can still register different personalities
through io_uring_register(2) with
IORING_REGISTER_PERSONALITY and specify the personality to use in
the sqe. Available since kernel 5.6.
- IORING_FEAT_FAST_POLL
- If this flag is set, then io_uring supports using an internal poll
mechanism to drive data/space readiness. This means that requests that
cannot read or write data to a file no longer need to be punted to an
async thread for handling, instead they will begin operation when the file
is ready. This is similar to doing poll + read/write in userspace, but
eliminates the need to do so. If this flag is set, requests waiting on
space/data consume a lot less resources doing so as they are not blocking
a thread. Available since kernel 5.7.
- IORING_FEAT_POLL_32BITS
- If this flag is set, the IORING_OP_POLL_ADD command accepts the
full 32-bit range of epoll based flags. Most notably EPOLLEXCLUSIVE
which allows exclusive (waking single waiters) behavior. Available since
kernel 5.9.
- IORING_FEAT_SQPOLL_NONFIXED
- If this flag is set, the IORING_SETUP_SQPOLL feature no longer
requires the use of fixed files. Any normal file descriptor can be used
for IO commands without needing registration. Available since kernel
5.11.
- IORING_FEAT_ENTER_EXT_ARG
- If this flag is set, then the io_uring_enter(2) system call
supports passing in an extended argument instead of just the
sigset_t of earlier kernels. This. extended argument is of type
struct io_uring_getevents_arg and allows the caller to pass in both
a sigset_t and a timeout argument for waiting on events. The struct
layout is as follows:
struct io_uring_getevents_arg {
__u64 sigmask;
__u32 sigmask_sz;
__u32 pad;
__u64 ts;
};
and a pointer to this struct must be passed in if
IORING_ENTER_EXT_ARG is set in the flags for the enter system call.
Available since kernel 5.11.
- IORING_FEAT_NATIVE_WORKERS
- If this flag is set, io_uring is using native workers for its async
helpers. Previous kernels used kernel threads that assumed the identity of
the original io_uring owning task, but later kernels will actively create
what looks more like regular process threads instead. Available since
kernel 5.12.
- IORING_FEAT_RSRC_TAGS
- If this flag is set, then io_uring supports a variety of features related
to fixed files and buffers. In particular, it indicates that registered
buffers can be updated in-place, whereas before the full set would have to
be unregistered first. Available since kernel 5.13.
- IORING_FEAT_CQE_SKIP
- If this flag is set, then io_uring supports setting
IOSQE_CQE_SKIP_SUCCESS in the submitted SQE, indicating that no CQE
should be generated for this SQE if it executes normally. If an error
happens processing the SQE, a CQE with the appropriate error value will
still be generated. Available since kernel 5.17.
- IORING_FEAT_LINKED_FILE
- If this flag is set, then io_uring supports sane assignment of files for
SQEs that have dependencies. For example, if a chain of SQEs are submitted
with IOSQE_IO_LINK, then kernels without this flag will prepare the
file for each link upfront. If a previous link opens a file with a known
index, eg if direct descriptors are used with open or accept, then file
assignment needs to happen post execution of that SQE. If this flag is
set, then the kernel will defer file assignment until execution of a given
request is started. Available since kernel 5.17.
The rest of the fields in the struct io_uring_params are
filled in by the kernel, and provide the information necessary to memory map
the submission queue, completion queue, and the array of submission queue
entries. sq_entries specifies the number of submission queue entries
allocated. sq_off describes the offsets of various ring buffer
fields:
struct io_sqring_offsets {
__u32 head;
__u32 tail;
__u32 ring_mask;
__u32 ring_entries;
__u32 flags;
__u32 dropped;
__u32 array;
__u32 resv[3];
};
Taken together, sq_entries and sq_off provide all of
the information necessary for accessing the submission queue ring buffer and
the submission queue entry array. The submission queue can be mapped with a
call like:
ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
ring_fd, IORING_OFF_SQ_RING);
where sq_off is the io_sqring_offsets structure, and
ring_fd is the file descriptor returned from
io_uring_setup(2). The addition of sq_off.array to the length
of the region accounts for the fact that the ring located at the end of the
data structure. As an example, the ring buffer head pointer can be accessed
by adding sq_off.head to the address returned from
mmap(2):
head = ptr + sq_off.head;
The flags field is used by the kernel to communicate state
information to the application. Currently, it is used to inform the
application when a call to io_uring_enter(2) is necessary. See the
documentation for the IORING_SETUP_SQPOLL flag above. The
dropped member is incremented for each invalid submission queue entry
encountered in the ring buffer.
The head and tail track the ring buffer state. The tail is
incremented by the application when submitting new I/O, and the head is
incremented by the kernel when the I/O has been successfully submitted.
Determining the index of the head or tail into the ring is accomplished by
applying a mask:
index = tail & ring_mask;
The array of submission queue entries is mapped with:
sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
ring_fd, IORING_OFF_SQES);
The completion queue is described by cq_entries and
cq_off shown here:
struct io_cqring_offsets {
__u32 head;
__u32 tail;
__u32 ring_mask;
__u32 ring_entries;
__u32 overflow;
__u32 cqes;
__u32 flags;
__u32 resv[3];
};
The completion queue is simpler, since the entries are not
separated from the queue itself, and can be mapped with:
ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
IORING_OFF_CQ_RING);
Closing the file descriptor returned by io_uring_setup(2)
will free all resources associated with the io_uring context.