[libc][GSoC 2025] Direct I/O from the GPU with io_uring

Description: Modern GPUs are capable of unified addressing with the host. We currently use this to provide I/O support using the RPC interface. However, this requires a dedicated user thread on the CPU to handle the server code. We want to explore alternatives to providing I/O from the GPU using the Linux io_uring interface.

This interface is a ring buffer designed to accelerate syscalls. However, it provides a polling mode. that allows the kernel to flush the ring buffer without the user initiating a system call. We should be able to register mmap() memory with the GPU using AMD and NVIDIA API calls. This interface should allow us to implement a rudimentary read / write interface which can be thought of as the same as the syscall on the CPU. That can then be used to implement a whole file interface.

Expected Results:

  • An implementation of pwrite and pread that runs on the GPU.
  • Support for printf by forwarding snprintf into pwrite.
  • If time permits, exploring GPU file APIs.

Project Size: Small

Requirement: Basic C & C++ skills + access to a GPU (AMD highly preferred, mostly likely to work), Linux kernel knowledge, GPU knowledge

Difficulty: Hard

Confirm Mentor: Joseph Huber, Tian Shilei

5 Likes

I would like to work on that. The project is interesting and I/O and file management is related to my PhD research.

1 Like

I would like to participate on this project. I’ll be tidyng up some gaps to be ready when the mentee application begins.

I have my doubts whether the implementation will be done with the linux API or its C library wrapper liburing.

1 Like

The intention is for it to be done through the linux API to keep dependencies low and because we will likely need low-level access.

Hi everyone!

I’m really interested in this project and would love to contribute. I’m a Master’s student in Computer Engineering at TU Delft, with a background in AI acceleration, GPU programming, and system optimization. My research focuses on optimizing LLM inference for resource-constrained hardware, and I have experience working with CUDA, GPU memory management, and low-level optimizations.

I’d love to know what would be a good starting point for getting up to speed. Are there any specific papers, existing implementations, or related issues that would help me prepare?

Also, I noticed that the description mentions AMD GPUs as highly preferred. Would it be possible to experiment with NVIDIA GPUs as well, or is AMD required due to specific hardware support?

Looking forward to contributing and learning more about this project!

It’s preferred because the support for unified memory on AMD GPUs is more well-defined. The unified addressing in the CUDA driver API doesn’t let you directly map existing mmap pointers AFAIK.

If you want to look into this yourself you could get started by working with the existing io_uring syscalls on the host CPU. Just keep in mind that there are other people interested in this project, so you might want to wait until the final decisions are made.

1 Like

Hi everyone, my name is Yuhang. I am interested in this related work. I am a PhD student, and I have conducted research related to SSDs and io_uring. I worked on accelerating on-disk GNN sampling using io_uring, and I see great potential in this library. I deeply understand the importance of improving disk-to-GPU reading throughput for out-of-core training and inference. I am highly interested in this area and see this as a great opportunity to work in a similar field again.

I’m really interested in this project and would love to contribute. I have some experience with systems programming and working with GPUs, and I’m eager to learn more. Could you please point me to any resources or starting points that would help me get involved?

Sorry, the submission period is over.

Is there a link to the commit? I’m curious how this project was implemented

It hasn’t started yet, it’s just the application window is over and now we’re choosing it. It’ll hopefully be done by the Fall.

Bad news, there was a clerical error during the project allocation phase that left this one out. There’s no way to reverse it even though it’s not my fault, so this project isn’t happening.

I was wondering, I guess having an implementation for pread and pwrite for GPUs using Io uring might be straight forward (I am speculating, not sure).
I see the issue might be the kernel thread in I/o uring, as far as I understand, the kernel has a single CPU thread serving the requests to disk into the queue, this will be a bottleneck due to the parallel nature of the gpu. Have you had ideas on how to overcome this limitation?

I know there is no gsoc funding, but I find the idea incredible, and would like to work on it.

Thanks

The polling mode, which is what I find interesting from the GPU case, only runs on a single kernel thread as far as I’m aware. You could create multiple io_uring interfaces and partition them across the GPU somehow, I don’t know if the interconnect is fast enough to fully saturate a single kernel thread.

As for working on this, without the GSoC platform I’m not planning on taking a mentor position unfortunately. I was going to take some time to work on this myself once I finished my current project. If you still want to investigate this independently, feel free to do so and ask questions if needed.

1 Like