Nvidia driver mismatch

Hi Everyone,

I have one Server with below information:

  • OS: Vmware ESXI 6.5
  • VM:Ubuntu 20.04.3 LTS
    CUDA version: 11.4
    Driver version: 470.223.02
  • Card GPU: Tesla T4, config Passthrough

About the last 4 months, sometimes the operating system reports an error saying the GPU card cannot be detected → application error, restarting the VM works normally.
But randomly every few weeks it will fail again.

I check and see warnings on VM:

  • Warning: Failed to initialize NVML: Driver/library version mismatch on VMs use GPU card
    Warning on server hardware management application:
    Accelerator in Slot 1 has OS driver missing or not in persistent mode so power sensor is unknown

Please check for me, what error is the server experiencing with the GPU card and how to resolve it.
Many thanks

Please run nvidia-bug-report.sh as root after the issue appeared and attach the resulting nvidia-bug-report.log.gz file to your post.

nvidia-bug-report.log (1).gz (487.0 KB)

Please check help me.
Many thanks

Not much to be seen in the log. The T4 is working properly according to nvidia-smi, also no library mismatches which usually happen on driver updates. Last boot was on december, 21st. Unfortunately, log persistence seems to be turned off, so there’s no data from previous boots.