I have a set of V100 GPUs in a private data center. Previously, this system ran Ubuntu 18.04 and the nvidia drivers. After updating to Ubuntu 20.04 I am unable to update the nvidia drivers.
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Attempts to install different versions of the nvidia-driver package all fail the same way. The install fails with Bad return status for module build on kernel error
, indicating a failed invocation of make(1)
. I’ve tried to apt remove the nvidia packages in an attempt to get the system back to something like a clean state, I’ve tried various builds of the nvidia-drivers, including nvidia-driver-550
as well as the driver version shown here – I always get the same failure.
nvidia-bug-report.log.gz (233.8 KB)
Reading package lists... Done
Building dependency tree
Reading state information... Done
nvidia-driver-535 is already the newest version (535.161.08-0ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 52 not upgraded.
2 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
Setting up nvidia-dkms-535 (535.161.08-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)
A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf
A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
`/usr/sbin/initramfs -u`
*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can ***
*** be loaded. ***
*****************************************************************************
INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
Removing old nvidia-535.161.08 DKMS files...
------------------------------
Deleting module version: 535.161.08
completely from the DKMS tree.
------------------------------
Done.
Loading new nvidia-535.161.08 DKMS files...
Building for 5.15.0-101-generic
Building for architecture x86_64
Building initial module for 5.15.0-101-generic
Error! Bad return status for module build on kernel: 5.15.0-101-generic (x86_64)
Consult /var/lib/dkms/nvidia/535.161.08/build/make.log for more information.
dpkg: error processing package nvidia-dkms-535 (--configure):
installed nvidia-dkms-535 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of nvidia-driver-535:
nvidia-driver-535 depends on nvidia-dkms-535 (= 535.161.08-0ubuntu1); however:
Package nvidia-dkms-535 is not configured yet.
dpkg: error processing package nvidia-driver-535 (--configure):
dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
Processing triggers for initramfs-tools (0.136ubuntu6.7) ...
update-initramfs: Generating /boot/initrd.img-5.15.0-101-generic
Errors were encountered while processing:
nvidia-dkms-535
nvidia-driver-535
E: Sub-process /usr/bin/dpkg returned an error code (1)
Looking into the make.log file, I always find the error at the end of the log:
make -f ./scripts/Makefile.modpost
sed 's/\.ko$/\.o/' /var/lib/dkms/nvidia/535.161.08/build/modules.order | scripts/mod/modpost -m -a -o /var/lib/dkms/nvidia/535.161.08/build/Module.symvers -e -i Module.symvers -i /usr/src/ofa_kernel/default/Module.symvers -T -
FATAL: modpost: parse error in symbol dump file
make[2]: *** [scripts/Makefile.modpost:133: /var/lib/dkms/nvidia/535.161.08/build/Module.symvers] Error 1
make[1]: *** [Makefile:1830: modules] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-101-generic'
make: *** [Makefile:82: modules] Error 2
Here are the GPUs in this system:
$ lspci -nnk | grep -i nvid
1f:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
Kernel modules: nvidiafb, nouveau
65:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
Kernel modules: nvidiafb, nouveau
b6:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
Kernel modules: nvidiafb, nouveau
df:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
Kernel modules: nvidiafb, nouveau