Hello everyone,
after a recent (necessary) kernel upgrade on one of our servers, we experience some problems with NVIDIA GPU Direct Storage.
The server is running Ubuntu 22.04 with a 6.6.5 kernel.
The installation instructions CUDA Installation Guide for Linux note that there are special package
version restrictions for servers not running the NVIDIA open kernel driver. As I understand the GPUs on this server (V100s) are not supported by the NVIDIA open kernel driver, so we installed nvidia-gds-12-1. The nvidia driver version is 535 (packages nvidia-dkms-535, nvidia-driver-535).
Since the nvidia-fs version pulled by apt is too high (2.18.3), we manually installed dkms module nvidia-fs 2.17.4 from https://github.com/NVIDIA/gds-nvidia-fs/archive/refs/tags/v2.17.4.zip. We confirmed that the active kernel module is the 2.17.4. version.
In this setup, we can load the nvidia-fs module …
# in dmesg:
nvidia_fs: Initializing nvfs driver module
nvidia_fs: registered correctly with major number 509
… but we can not e.g. run the gdscheck tool:
$ /usr/local/cuda-12.1/gds/tools/gdscheck -p
Platform verification error :
nvidia-fs driver is not loaded
# in dmesg:
failing symbol_get of non-GPLONLY symbol nvidia_p2p_dma_unmap_pages.
nvidia-fs:Unable to find symbol: nvidia_p2p_dma_unmap_pages
nvidia-fs:Could not load nvidia_p2p* symbols
Here are the symbols included in the running nvidia.ko, which seem to include the reported missing symbol:
$ nm -a nvidia.ko | grep nvidia_p2p_dma
0000000000000134 r __crc_nvidia_p2p_dma_map_pages
0000000000000138 r __crc_nvidia_p2p_dma_unmap_pages
0000000000000070 r __export_symbol_nvidia_p2p_dma_map_pages
0000000000000080 r __export_symbol_nvidia_p2p_dma_unmap_pages
00000000000000d8 r __kstrtabns_nvidia_p2p_dma_map_pages
00000000000000f4 r __kstrtabns_nvidia_p2p_dma_unmap_pages
00000000000000bf r __kstrtab_nvidia_p2p_dma_map_pages
00000000000000d9 r __kstrtab_nvidia_p2p_dma_unmap_pages
000000000000039c r __ksymtab_nvidia_p2p_dma_map_pages
00000000000003a8 r __ksymtab_nvidia_p2p_dma_unmap_pages
000000000000eb10 T nvidia_p2p_dma_map_pages
000000000000d9c0 T nvidia_p2p_dma_unmap_pages
000000000000eb00 T __pfx_nvidia_p2p_dma_map_pages
000000000000d9b0 T __pfx_nvidia_p2p_dma_unmap_pages
Any idea what might be the reason why nvidia_fs is not working as expected?
Let me know if I can provide any additional information, and thanks already in advance!
Possibly related post: