I’ve been looking at io_uring
for a while now, and I think that it is a fantastic idea.
At minimum, it allows batching system calls for a performance benefit in I/O-heavy
applications. When used properly, it provides the basis for practical yet efficient
single-threaded concurrent software. However, I don’t believe that it’s discussed
(excitedly) enough, so I’m going to discuss it (excitedly).
An Introduction
In order for a program to do anything beyond arithmetic, it needs to interact with the outside world. On typical systems, this is achieved by interacting with the operating system: specifically, the program requests the OS kernel to perform some task on its behalf, such as writing to a file or reading input from a user. To do so, it executes a system call, wherein the CPU securely transitions into the operating system’s space, from which it can do actually achieve these tasks. During this transition, the program can communicate the task to perform; once the OS is ready, it will execute the task. Note that multiple tasks cannot be communicated at once; but historically, this was not a problem, since most tasks didn’t take that long anyways (tasks which were known to take very long often accumulated ad-hoc systems by which work can be done in the background, and progress reported back to the program; an example of this is non-blocking socket I/O).
io_uring
depends upon the same system call interface, but changes how tasks are
communicated to the OS. It uses two queues: the submission queue, where the program
places tasks to execute, and the completion queue, where the kernel communicates the
results of executing submitted tasks. Typically, the program fills up the submission
queue with as many tasks as it has, then hands them over to the OS with a single system
call. These queues are already registered with the OS and so it can read from them
directly. The OS can execute the submitted tasks in any order, and as each one is
completed, add the task’s result back to the completion queue. There are lots of
additional features and minute details that I’m ignoring here, but this is a basic
conceptual description of io_uring
(the “ring” in the name comes from the fact that
these queues are implemented as ring buffers).
At this point (Linux 5.15, to be exact), a large number of system calls can be executed
via io_uring
(it doesn’t support everything, because the way that tasks are
represented between system calls and io_uring
are different, and because not all tasks
can benefit from io_uring
). It’s quite feasible to rely upon it to achieve almost all
of the work a program does.
There are a large number of benefits to using io_uring
. There’s significantly less
overhead (aside from a little additional setup at the beginning of the program), as the
CPU takes a long time to perform each system call (I don’t have exact figures, but
“thousands of cycles” seems to be the ballpark). It allows programs to function more
efficiently in general: they don’t have to wait for each external task to finish before
continuing execution (although multithreading does make this a non-issue, it is its own
can of worms that developers would reasonably not want to deal with). And, as I’ll
discuss in a second, it fits in very nicely with concurrency.
Concurrency
Concurrency and parallelism both solve the same problem: that multiple independent tasks (computational or otherwise) need to happen simultaneously. Parallelism allows them to literally occur at the same time, whereas concurrency breaks them up into smaller units of work, then interleaves them, so that all of them are happening at once, even if only one of them is executing at any particular moment in time. It’s interesting to note that almost all running systems have more threads (parallel tasks) than they have CPUs, so the OS resorts to interleaving them anyways. Parallelism also comes with additional overhead (in the form of thread-safe communication primitives like atomics and mutexes, which are far slower than direct memory access) that concurrency does not require.
Using concurrency, a program can request a time-consuming task to run in the background, and can continue performing other work, even when limited to a single thread of execution. In most cases, the task will be performed whenever other parts of the program are busy waiting upon some event to occur (which is extremely common), and so can happen in the “background” from the perspective of those parts. If the task’s caller requires its result, then it will simply transition into waiting for the task to finish; whenever the task does finish, the caller will be automatically resumed. Until then, other parts of the program will execute.
io_uring
works extremely well with concurrency: in particular, with a concurrent
event loop. Essentially, a program using io_uring
this way performs I/O tasks by
simply adding to the io_uring
submission queue, and occasionally calling the
io_uring
system call so that the kernel knows about all the new tasks. It will
periodically check the completion queue, match completions to the I/O tasks they
represent, and mark those tasks as finished, so that any callers waiting upon them are
resumed. This is extremely efficient because the program will continue doing as much
work as it possibly can until it is waiting upon (any number of) I/O tasks, at which
point it will simply suspend execution and let the OS do its thing. This is as
theoretically-perfect a model as it gets.
Fin
I think that io_uring
is awesome because it makes it practical for everyday programs
to be much more efficient. It should help facilitate the growing movement for
concurrency (see Go, Zig, Rust for programming languages in this area) and it is a
better way for things to work.
I’m probably going to write more than a few programs using io_uring
, across a variety
of languages. I’m planning to start by rewriting a coreutils program using io_uring
in C.
Links
It’s hard to find good reference for io_uring
; there is the liburing
C wrapper
library, which provides a nicer API, a few good man pages, and a PDF; I assume there is
some good documentation within the Linux kernel source code itself; and there will be a
few other useful blog posts and the like. I think that the man pages (particularly
io_uring_setup
and io_uring_enter
) are the most important, as they basically provide
enough information to write liburing
from scratch. It would also be a good idea to
find example programs using io_uring
.