io_uring is awesome

by tdmm on

I’ve been looking at io_uring for a while now, and I think that it is a fantastic idea. At minimum, it allows batching system calls for a performance benefit in I/O-heavy applications. When used properly, it provides the basis for practical yet efficient single-threaded concurrent software. However, I don’t believe that it’s discussed (excitedly) enough, so I’m going to discuss it (excitedly).

An Introduction

In order for a program to do anything beyond arithmetic, it needs to interact with the outside world. On typical systems, this is achieved by interacting with the operating system: specifically, the program requests the OS kernel to perform some task on its behalf, such as writing to a file or reading input from a user. To do so, it executes a system call, wherein the CPU securely transitions into the operating system’s space, from which it can do actually achieve these tasks. During this transition, the program can communicate the task to perform; once the OS is ready, it will execute the task. Note that multiple tasks cannot be communicated at once; but historically, this was not a problem, since most tasks didn’t take that long anyways (tasks which were known to take very long often accumulated ad-hoc systems by which work can be done in the background, and progress reported back to the program; an example of this is non-blocking socket I/O).

io_uring depends upon the same system call interface, but changes how tasks are communicated to the OS. It uses two queues: the submission queue, where the program places tasks to execute, and the completion queue, where the kernel communicates the results of executing submitted tasks. Typically, the program fills up the submission queue with as many tasks as it has, then hands them over to the OS with a single system call. These queues are already registered with the OS and so it can read from them directly. The OS can execute the submitted tasks in any order, and as each one is completed, add the task’s result back to the completion queue. There are lots of additional features and minute details that I’m ignoring here, but this is a basic conceptual description of io_uring (the “ring” in the name comes from the fact that these queues are implemented as ring buffers).

At this point (Linux 5.15, to be exact), a large number of system calls can be executed via io_uring (it doesn’t support everything, because the way that tasks are represented between system calls and io_uring are different, and because not all tasks can benefit from io_uring). It’s quite feasible to rely upon it to achieve almost all of the work a program does.

There are a large number of benefits to using io_uring. There’s significantly less overhead (aside from a little additional setup at the beginning of the program), as the CPU takes a long time to perform each system call (I don’t have exact figures, but “thousands of cycles” seems to be the ballpark). It allows programs to function more efficiently in general: they don’t have to wait for each external task to finish before continuing execution (although multithreading does make this a non-issue, it is its own can of worms that developers would reasonably not want to deal with). And, as I’ll discuss in a second, it fits in very nicely with concurrency.

Concurrency

Concurrency and parallelism both solve the same problem: that multiple independent tasks (computational or otherwise) need to happen simultaneously. Parallelism allows them to literally occur at the same time, whereas concurrency breaks them up into smaller units of work, then interleaves them, so that all of them are happening at once, even if only one of them is executing at any particular moment in time. It’s interesting to note that almost all running systems have more threads (parallel tasks) than they have CPUs, so the OS resorts to interleaving them anyways. Parallelism also comes with additional overhead (in the form of thread-safe communication primitives like atomics and mutexes, which are far slower than direct memory access) that concurrency does not require.

Using concurrency, a program can request a time-consuming task to run in the background, and can continue performing other work, even when limited to a single thread of execution. In most cases, the task will be performed whenever other parts of the program are busy waiting upon some event to occur (which is extremely common), and so can happen in the “background” from the perspective of those parts. If the task’s caller requires its result, then it will simply transition into waiting for the task to finish; whenever the task does finish, the caller will be automatically resumed. Until then, other parts of the program will execute.

io_uring works extremely well with concurrency: in particular, with a concurrent event loop. Essentially, a program using io_uring this way performs I/O tasks by simply adding to the io_uring submission queue, and occasionally calling the io_uring system call so that the kernel knows about all the new tasks. It will periodically check the completion queue, match completions to the I/O tasks they represent, and mark those tasks as finished, so that any callers waiting upon them are resumed. This is extremely efficient because the program will continue doing as much work as it possibly can until it is waiting upon (any number of) I/O tasks, at which point it will simply suspend execution and let the OS do its thing. This is as theoretically-perfect a model as it gets.

Fin

I think that io_uring is awesome because it makes it practical for everyday programs to be much more efficient. It should help facilitate the growing movement for concurrency (see Go, Zig, Rust for programming languages in this area) and it is a better way for things to work.

I’m probably going to write more than a few programs using io_uring, across a variety of languages. I’m planning to start by rewriting a coreutils program using io_uring in C.

Links

It’s hard to find good reference for io_uring; there is the liburing C wrapper library, which provides a nicer API, a few good man pages, and a PDF; I assume there is some good documentation within the Linux kernel source code itself; and there will be a few other useful blog posts and the like. I think that the man pages (particularly io_uring_setup and io_uring_enter) are the most important, as they basically provide enough information to write liburing from scratch. It would also be a good idea to find example programs using io_uring.