Original Date Published: March 8, 2022
This blog posts covers io_uring
, a new Linux kernel system call interface, and how I exploited it for local privilege escalation (LPE)
A breakdown of the topics and questions discussed:
io_uring
? Why is it used?io_uring
codebase for tools to construct exploit primitives.Like my last post, I had no knowledge of io_uring
when starting this project. This blog post will document the journey of tackling an unfamiliar part of the Linux kernel and ending up with a working exploit. My hope is that it will be useful to those interested in binary exploitation or kernel hacking and demystify the process. I also break down the different challenges I faced as an exploit developer and evaluate the practical effect of current exploit mitigations.
Put simply, io_uring
is a system call interface for Linux. It was first introduced in upstream Linux Kernel version 5.1 in 2019 [1]. It enables an application to initiate system calls that can be performed asynchronously. Initially, io_uring
just supported simple I/O system calls like read()
and write()
, but support for more is continually growing, and rapidly. It may eventually have support for most system calls [5].
The motivation behind io_uring
is performance. Although it is still relatively new, its performance has improved quickly over time. Just last month, the creator and lead developer Jens Axboe boasted 13M per-core peak IOPS [2]. There are a few key design elements of io_uring
that reduce overhead and boost performance.
With io_uring
system calls can be completed asynchronously. This means an application thread does not have to block while waiting for the kernel to complete the system call. It can simply submit a request for a system call and retrieve the results later; no time is wasted by blocking.
Additionally, batches of system call requests can be submitted all at once. A task that would normally requires multiple system calls can be reduced down to just 1. There is even a new feature that can reduce the number of system calls down to zero [7]. This vastly reduces the number of context switches from user space to kernel and back. Each context switch adds overhead, so reducing them has performance gains.
In io_uring
a bulk of the communication between user space application and kernel is done via shared buffers. This reduces a large amount of overhead when performing system calls that transfer data between kernel and userspace. For this reason, io_uring
can be a zero-copy system [4].
There is also a feature for “fixed” files that can improve performance. Before a read or write operation can occur with a file descriptor, the kernel must take a reference to the file. Because the file reference occurs atomically, this causes overhead [6]. With a fixed file, this reference is held open, eliminating the need to take the reference for every operation.
The overhead of blocking, context switches, or copying bytes may not be noticeable for most cases, but in high performance applications it can start to matter [8]. It is also worth noting that system call performance has regressed after workaround patches for Spectre and Meltdown, so reducing system calls can be an important optimization [9].
As noted above, high performance applications can benefit from using io_uring
. It can be particularly useful for applications that are server/backend related, where a significant proportion of the application time is spent waiting on I/O.
Initially, I intended to use io_uring
by making io_uring
system calls directly (similar to what I did for eBPF). This is a pretty arduous endeavor, as io_uring
is complex and the user space application is responsible for a lot of the work to get it to function properly. Instead, I did what a real developer would do if they wanted their application to make use of io_uring
- use liburing
.
liburing
is the user space library that provides a simplified API to interface with the io_uring
kernel component [10]. It is developed and maintained by the lead developer of io_uring
, so it is updated as things change on the kernel side.