Floe Blog

After 30+ years, "Is Linux disk I/O finally fast enough?"

Written by Brini | Mar 26, 2026 2:20:11 PM

Linux has come a really long way when it comes to getting the best performance out of SSDs. I recently spent a bunch of time benchmarking Linux's io_uring against a driver I wrote 11 years ago. Although Linux has improved, sadly my driver still outperforms it in a number of areas, but the gap has narrowed sufficiently that it's realistically not worth maintaining my own code anymore.

Still, it's a shame to think that that despite so much work and funding from massive corporations, the world's most popular kernel still can't easily get the best out of SSDs.

Why did I write a driver?

Ten years ago, we were building software to process small blocks of data directly from disk without a buffer cache, as efficiently as possible. We were using a server with 8 CPU cores (Haswell/Broadwell) and 8 fast NVMe SSDs. At the time, that meant driving millions of IOPS (I/Os per second) in aggregate, for 8KB-32KB disk blocks, on a CPU with 8 cores. The problem is still very relevant today because both CPU core count and drive IOPS have increased in tandem by an order of magnitude.

The Linux I/O subsystem at the time wasn't up to the job. To submit a read request, the process performed a system call (syscall) that crossed from the user process into the kernel, then traversed a complex block layer and driver stack that suffered from lock contention under concurrency. When the SSD completed the read, it triggered interrupts to the kernel, again switching the CPU back out of the user process to deliver the I/O. Even accessing the raw device driver using O_DIRECT would cost around 10,000-20,000 CPU cycles per I/O. Jens Axboe, maintainer of the Linux block layer), who I worked with for several years prior, noted that one whole CPU core would be fully consumed when accomplishing 800,000 IOPS. AIO was never stable and didn't help much with this.

For my application, which could drive around 2.4 million IOPS, almost 40% of the CPU would be wasted just doing I/O and unavailable to process data; and the data processing would be really inefficient due to all the context switching and cache thrashing taking place.

I didn't have cycles to devote man-months or years to optimizing the Linux I/O subsystem, so the easiest approach to get more performance was to bypass it entirely. I could map the NVMe device into my user process and provide memory-based command submission and completion queues. By pinning all of the memory and walking the page tables, PCI DMA could perform I/O to and from anywhere in my process with no buffers or data copies required. I polled for I/O completions periodically, or more aggressively when the CPU was idle. Linux was out of the picture and the cost of an I/O was reduced to ±300 CPU cycles, an improvement of almost two orders of magnitude.

2026 state-of-the-art

Wind the clock forward to 2026. Linux invented io_uring for faster, asynchronous I/O with less interrupts and the promise of far more optimal I/O. It wasn't stable in 2019-2020 when we last looked at it, but what about now - can it deliver? I worry that although system calls have got faster with modern CPUs and the I/O code paths and contention have been addressed, the argument might be tempered by additional overheads required for security mitigations.

Can I now throw away the 11-year-old driver I hacked together in a week? Let's find out!

Approach

I started by implementing a new driver for my userspace kernel that uses io_uring instead of direct NVMe. That way I could run our existing benchmark suites and compare the two drivers. Once I shook out a few stability challenges with the implementation, I could quickly see that the io_uring performance was not too shabby though the submit latencies were rather high.

Results and further experiments

These benchmarks were executed on a single-socket AMD "Milan" server with 64 cores and with one or four Samsung PM9A3 SSDs.

Single SSD Bandwidth

As expected, my userspace driver (blue) performs extremely well at small block sizes.

  • We get close to maximizing bandwidth at a 16KB block size

  • At 32KB and above we can completely saturate the drives.

By comparison, io_uring in its default configuration (red):

  • I was only able to maximize bandwidth 128KB+ blocks and above.

  • At 32KB, it was delivering under half the throughput of my userspace driver.

  • It is also clear that out of the box, io_uring is maxing out at around a queue depth of 16.

Overall, for small block sizes the io_uring performance is still at least 2x slower than the userspace driver at a given queue depth. 

Tweaking the options

This led to an examination of the tweaks that could be applied io_uring driver to improve performance, as follows.

First off, the inability to push beyond a queue depth of 16 got me looking at IORING_SETUP_SQPOLL mode. This uses a separate kernel thread to poll the submission queue to spot when userspace has added new commands to the queue.

We can see this (orange) removed the queue depth bottleneck and also dropped the latency dropped to 1-2µs, on par with the userspace driver but we still cannot max out the drive until we use 128KB block sizes. In addition it improved the submit latency (see note below) However, this isn't all magic, there is a cost as noted below.

There are a couple more options of interest that can be used for io_uring. First there is the IORING_SETUP_IOPOLL flag which, when used with a driver that supports I/O polling, can further improve throughput. In practice this only gave me marginal performance improvements so I did not plot it on the graph.

I then considered trying an option that I'd previously dismissed. I am already using register_files() to reduce FD lookup overhead but there is another option register_buffers() that removes the overhead of mapping the buffers every time. The reason for dismissing this previously was that some parts of the use case need to perform I/O from anywhere in the 1TB+ memory map, making it impractical to pre-register all the buffers that might be needed. This is perhaps the difference between a general-purpose driver and something more specialised and optimal where additional constraints simplify data copies and the programming model. Since there are still some use cases where a fixed buffer pool could be used (e.g. I/O in and out of compression buffers) I also tested this option. The results (green) show that it made a big difference, alongside IORING_SETUP_SQPOLL to get the queue depth up and latency down. Combining this with the IORING_SETUP_IOPOLL basically gives us performance that's as good (gray) with io_uring as a userspace NVMe driver.

Scalability: Four SSDs

So far these numbers look great, but I/O to one SSD is not much, so I wanted to understand how things scale. By running the same tests with the same parameters but using four drives concurrently, one would hope we get linear scaling.

Here we start to see some gaps:

  • The userspace NVMe driver has scaled pretty linearly from 6.8GB/s to 27.2GB/s.

  • The io_uring driver has plateaued at under half the bandwidth.

  • Only as we get up to 128KB blocks io_uring manages to catch up.

  • Out of the box, io_uring spun up 4 SQPOLL threads in the kernel, but changing things to share a single thread did not help.

  • Needless to say, having four threads polling full-time was eating much more CPU than my userspace driver.

Latency

The Typical submit latency of my userspace driver was measured at 1-2µs. For io_uring in default mode, typical submit latency is much higher, measured at 13µs, certainly higher than my driver but not as bad as I expected.

Things are looking up with IORING_SETUP_SQPOLL mode. Latency dropped to 1-2µs, on par with the userspace driver. However, this isn't all magic, as noted below.

Negative impact of IORING_SETUP_SQPOLL mode

Using IORING_SETUP_SQPOLL dedicates an entire kernel iou-sqp-* thread that spins burning an additional CPU core in order to achieve the lower latency and improved performance. On machines with high core counts this overhead may be acceptable, but a large amount of potential compute throughput is still being wasted. On machines with a lower ratio of CPU count to SSD bandwidth, such as an I/O monster with 8 fast SSDs and 16 vCPUs, it will likely be unacceptable.

Conclusion

If all you are doing are large reads and writes with block sizes 64K+, then io_uring gives you all you need with out-of-the-box APIs. If, however, you expect to perform a lot of smaller I/Os in the 4K-32K range, then even with io_uring you need to tweak your setup and use more of the advanced options that are supported. If you can manage all your I/O with a fixed buffer pool then register_buffers() is the way to go. The fact that it did not scale well to multiple SSD means performance is still left on the table.

It may be possible to further improve the io_uring results by batching requests etc. but then you start coupling more of your application to the I/O framework to get the desired performance. Sometimes it is nice to know that even just basic I/O submission is fast.

So what does this mean for Floe?

We have two different I/O patterns to be concerned about. In the first case, larger random I/Os throughout a large memory map, the io_uring approach will give us reasonable performance and more if we dedicate some kernel polling threads but we can't saturate the drives.

The second case demands a lot of small I/Os but typically these will require decompressing from I/O buffers to elsewhere in memory, so utilising register_buffers() is a feasible approach to getting the performance we require but will still require dedicating extra cores to submission polling along with further investigation to see if we continue to improve things.

It's sad to let go of code you've have used for many years and that can still hold its own against the intense development that has taken place in the Linux kernel over the last decade, but that gap has now closed enough that the simplicity of using an off-the-shelf implementation vs a custom one is worth it for our application.

What's your take?

Given the performance you can now get out of io_uring are there any use cases for userspace drivers either home grown or something like the Intel SPDK anymore?

Have you struggled maxing out multiple drives and can more performance be obtained out of io_uring without using kernel side submission polling? I wonder if I'm still missing something here or not.