Files
NVIDIA_Drivers/help/common_issues.md
Will Russell 89cb78d1bd Update common_issues.md
appended instruction on what to do if you can't unload the module set for nvidia, causing a failed NVME mismatch continuously.
2022-10-02 11:52:29 -04:00

3.1 KiB

driver_mismatch.sh exists to try and automate some of this troubleshooting, but leaving a manual doc page around doesn't hurt

ISSUE:

Failed to initialize NVML: Driver/library version mismatch

The error message

NVML: Driver/library version mismatch

tell us the Nvidia driver kernel module (kmod) have a wrong version, so we should unload this driver, and then load the correct version of kmod How to do that ?

RESOLVING:

First, we should know which drivers are loaded.

lsmod | grep ^nvidia

you may get an output similar to the following:

module-name is on the left and modules that this module depends on are on the right.

nvidia_uvm 634880 8 nvidia_drm 53248 0 nvidia_modeset 790528 1 nvidia_drm nvidia 12312576 86 nvidia_modeset,nvidia_uvm

our final goal is to unload nvidia mod, so we should unload the modules that nvidia depends on first.

    sudo rmmod nvidia_uvm
    sudo rmmod nvidia_drm
    sudo rmmod nvidia_modeset

then, unload nvidia

    sudo rmmod nvidia

Troubleshooting

If you get an error like rmmod: ERROR: Module nvidia is in use, which indicates that the kernel module is occupied and cannot be stopped, you should kill the process that using the kmod:

`sudo lsof /dev/nvidia*`
or
`sudo lsof | grep nvidia*`
or 
`sudo ps -ef | grep nvidia`

and then kill those process, then continue to unload the kmods

confirm you successfully unloaded those kmods with another check:

`lsmod | grep ^nvidia`

you should get an empty string back. Then just confirm you can load the correct driver by restarting the service via the command interface:

`nvidia-smi`

If you continue to be unable to reload the module, restart the node or revalidate/re-install your driver set to ensure that there is no older modules being called by newer drivers. This is occasionally due to an install or update where the old versions were not fully purged during a cleanup process.

It's not working, I can't unload the modules

Try unloading the graphical target by booting to multi-user (this unloads the dependencies on your graphics card and should allow you to remove the modules) $ sudo systemctl set-default multi-user.target once unloaded, reboot the machine, login at the terminal window (note that you will have no graphical interface, this is expected). Then find and re-run the scripted fix: driver_mismatch.sh, or follow steps above again.

To reset your grapical target after you validate that things are working properly with nvidia-smi run: $ sudo systemctl set-default graphical.target

If the above still does not work, I highly advise removing your NVIDIA installation - it is probable an older build is still being referenced:

If you've installed via runfile at any point: $ sudo /usr/bin/nvidia-uninstall

otherwise, purge the installation

sudo apt remove nvidia-* -y
sudo apt autoremove
dpkg -l | grep nvidia
sudo apt remove <remaining-package-names-if-applicable>
lsmod | grep ^nvidia #there should be none loaded

Then, re-install using the NVIDIA_drivers.sh script