Unable to install nvidia-driver on Ubuntu 20.04 with V100 GPUs - "parse error in symbol dump file"

I have a set of V100 GPUs in a private data center. Previously, this system ran Ubuntu 18.04 and the nvidia drivers. After updating to Ubuntu 20.04 I am unable to update the nvidia drivers.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Attempts to install different versions of the nvidia-driver package all fail the same way. The install fails with Bad return status for module build on kernel error, indicating a failed invocation of make(1). I’ve tried to apt remove the nvidia packages in an attempt to get the system back to something like a clean state, I’ve tried various builds of the nvidia-drivers, including nvidia-driver-550 as well as the driver version shown here – I always get the same failure.

nvidia-bug-report.log.gz (233.8 KB)

Reading package lists... Done
Building dependency tree
Reading state information... Done
nvidia-driver-535 is already the newest version (535.161.08-0ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 52 not upgraded.
2 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
Setting up nvidia-dkms-535 (535.161.08-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)

A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf

A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
`/usr/sbin/initramfs -u`

*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can   ***
*** be loaded.                                                            ***
*****************************************************************************

INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
Removing old nvidia-535.161.08 DKMS files...

------------------------------
Deleting module version: 535.161.08
completely from the DKMS tree.
------------------------------
Done.
Loading new nvidia-535.161.08 DKMS files...
Building for 5.15.0-101-generic
Building for architecture x86_64
Building initial module for 5.15.0-101-generic
Error! Bad return status for module build on kernel: 5.15.0-101-generic (x86_64)
Consult /var/lib/dkms/nvidia/535.161.08/build/make.log for more information.
dpkg: error processing package nvidia-dkms-535 (--configure):
 installed nvidia-dkms-535 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of nvidia-driver-535:
 nvidia-driver-535 depends on nvidia-dkms-535 (= 535.161.08-0ubuntu1); however:
  Package nvidia-dkms-535 is not configured yet.

dpkg: error processing package nvidia-driver-535 (--configure):
 dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
                                                                                                          Processing triggers for initramfs-tools (0.136ubuntu6.7) ...
update-initramfs: Generating /boot/initrd.img-5.15.0-101-generic
Errors were encountered while processing:
 nvidia-dkms-535
 nvidia-driver-535
E: Sub-process /usr/bin/dpkg returned an error code (1)

Looking into the make.log file, I always find the error at the end of the log:

make -f ./scripts/Makefile.modpost
  sed 's/\.ko$/\.o/' /var/lib/dkms/nvidia/535.161.08/build/modules.order | scripts/mod/modpost -m -a  -o /var/lib/dkms/nvidia/535.161.08/build/Module.symvers -e -i Module.symvers -i /usr/src/ofa_kernel/default/Module.symvers   -T -
FATAL: modpost: parse error in symbol dump file
make[2]: *** [scripts/Makefile.modpost:133: /var/lib/dkms/nvidia/535.161.08/build/Module.symvers] Error 1
make[1]: *** [Makefile:1830: modules] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-101-generic'
make: *** [Makefile:82: modules] Error 2

Here are the GPUs in this system:

$ lspci -nnk | grep -i nvid
1f:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
        Kernel modules: nvidiafb, nouveau
65:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
        Kernel modules: nvidiafb, nouveau
b6:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
        Kernel modules: nvidiafb, nouveau
df:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
        Kernel modules: nvidiafb, nouveau

Did you manage to fix it? I encountered the same problem on ubuntu 20.04

Haven’t found a fix to this yet.