[libc][GSoC 2025] Direct I/O from the GPU with io_uring

jhuber6 · February 10, 2025, 9:44pm

Description: Modern GPUs are capable of unified addressing with the host. We currently use this to provide I/O support using the RPC interface. However, this requires a dedicated user thread on the CPU to handle the server code. We want to explore alternatives to providing I/O from the GPU using the Linux io_uring interface.

This interface is a ring buffer designed to accelerate syscalls. However, it provides a polling mode. that allows the kernel to flush the ring buffer without the user initiating a system call. We should be able to register mmap() memory with the GPU using AMD and NVIDIA API calls. This interface should allow us to implement a rudimentary read / write interface which can be thought of as the same as the syscall on the CPU. That can then be used to implement a whole file interface.

Expected Results:

An implementation of pwrite and pread that runs on the GPU.
Support for printf by forwarding snprintf into pwrite.
If time permits, exploring GPU file APIs.

Project Size: Small

Requirement: Basic C & C++ skills + access to a GPU (AMD highly preferred, mostly likely to work), Linux kernel knowledge, GPU knowledge

Difficulty: Hard

Confirm Mentor: Joseph Huber, Tian Shilei

rodrigo-ceccato · February 10, 2025, 10:38pm

I would like to work on that. The project is interesting and I/O and file management is related to my PhD research.

orphee · February 16, 2025, 6:25pm

I would like to participate on this project. I’ll be tidyng up some gaps to be ready when the mentee application begins.

I have my doubts whether the implementation will be done with the linux API or its C library wrapper liburing.

jhuber6 · February 16, 2025, 6:59pm

The intention is for it to be done through the linux API to keep dependencies low and because we will likely need low-level access.

wang-zhuoran · March 10, 2025, 9:56am

Hi everyone!

I’m really interested in this project and would love to contribute. I’m a Master’s student in Computer Engineering at TU Delft, with a background in AI acceleration, GPU programming, and system optimization. My research focuses on optimizing LLM inference for resource-constrained hardware, and I have experience working with CUDA, GPU memory management, and low-level optimizations.

I’d love to know what would be a good starting point for getting up to speed. Are there any specific papers, existing implementations, or related issues that would help me prepare?

Also, I noticed that the description mentions AMD GPUs as highly preferred. Would it be possible to experiment with NVIDIA GPUs as well, or is AMD required due to specific hardware support?

Looking forward to contributing and learning more about this project!

jhuber6 · March 10, 2025, 5:36pm

It’s preferred because the support for unified memory on AMD GPUs is more well-defined. The unified addressing in the CUDA driver API doesn’t let you directly map existing mmap pointers AFAIK.

If you want to look into this yourself you could get started by working with the existing io_uring syscalls on the host CPU. Just keep in mind that there are other people interested in this project, so you might want to wait until the final decisions are made.

yuhang · March 12, 2025, 7:19pm

Hi everyone, my name is Yuhang. I am interested in this related work. I am a PhD student, and I have conducted research related to SSDs and io_uring. I worked on accelerating on-disk GNN sampling using io_uring, and I see great potential in this library. I deeply understand the importance of improving disk-to-GPU reading throughput for out-of-core training and inference. I am highly interested in this area and see this as a great opportunity to work in a similar field again.

Akhil · April 29, 2025, 12:32pm

I’m really interested in this project and would love to contribute. I have some experience with systems programming and working with GPUs, and I’m eager to learn more. Could you please point me to any resources or starting points that would help me get involved?

jhuber6 · April 29, 2025, 12:33pm

Sorry, the submission period is over.

nimit25 · May 1, 2025, 1:26am

Is there a link to the commit? I’m curious how this project was implemented

jhuber6 · May 1, 2025, 1:28am

It hasn’t started yet, it’s just the application window is over and now we’re choosing it. It’ll hopefully be done by the Fall.

jhuber6 · May 8, 2025, 12:58pm

Bad news, there was a clerical error during the project allocation phase that left this one out. There’s no way to reverse it even though it’s not my fault, so this project isn’t happening.

abourramouss · May 11, 2025, 7:21am

I was wondering, I guess having an implementation for pread and pwrite for GPUs using Io uring might be straight forward (I am speculating, not sure).
I see the issue might be the kernel thread in I/o uring, as far as I understand, the kernel has a single CPU thread serving the requests to disk into the queue, this will be a bottleneck due to the parallel nature of the gpu. Have you had ideas on how to overcome this limitation?

I know there is no gsoc funding, but I find the idea incredible, and would like to work on it.

Thanks

jhuber6 · May 11, 2025, 3:59pm

The polling mode, which is what I find interesting from the GPU case, only runs on a single kernel thread as far as I’m aware. You could create multiple io_uring interfaces and partition them across the GPU somehow, I don’t know if the interconnect is fast enough to fully saturate a single kernel thread.

As for working on this, without the GSoC platform I’m not planning on taking a mentor position unfortunately. I was going to take some time to work on this myself once I finished my current project. If you still want to investigate this independently, feel free to do so and ask questions if needed.

Topic		Replies	Views
[OpenMP][GSoC 2024] Improve GPU First Framework GSoC gsoc2024	20	843	March 23, 2024
[libc][GSoC 2024] Performance and testing in the GPU libc GSoC gpu , libc , gsoc2024	55	1262	March 31, 2024
The libc implementation for the GPUs OpenMP	25	1393	December 12, 2022
[libc][GSoC 2025] Profiling and testing of the LLVM libc GPU math GSoC gpu , libc , offload , gsoc2025	17	816	April 6, 2025
[RFC][libc] Exporting the RPC interface for the GPU libc C gpu	18	709	June 2, 2023

[libc][GSoC 2025] Direct I/O from the GPU with io_uring

Related topics