When dealing with issues related to AMD GPU drivers not loading after rebooting a VM configured with GPU passthrough on Virt-Manager, several factors could be at play. Here’s a step-by-step guide to diagnose and potentially resolve this issue:
Step-by-Step Troubleshooting Guide
- Verify Host Configuration for IOMMU and VFIO
- Enable IOMMU in BIOS:
- For Intel systems, look for
VT-d
in BIOS settings. - For AMD systems, look for
AMD-Vi
orSVM
.
- For Intel systems, look for
- Enable IOMMU in the Kernel:
Edit/etc/default/grub
and add the following toGRUB_CMDLINE_LINUX_DEFAULT
:intel_iommu=on iommu=pt # For Intel amd_iommu=on iommu=pt # For AMD
Update GRUB and reboot:sudo update-grub sudo reboot
- Bind GPU to VFIO-PCI:
Identify your GPU and audio device’s IDs with:lspci -nn | grep -i vga lspci -nn | grep -i audio
Create or edit/etc/modprobe.d/vfio.conf
:options vfio-pci ids=xxxx:yyyy,aaaa:bbbb
Replacexxxx:yyyy
andaaaa:bbbb
with the respective IDs. - Blacklist Host GPU Drivers:
Prevent the host from loading its own drivers for the GPU. Add the following to/etc/modprobe.d/blacklist.conf
:bash blacklist amdgpu blacklist radeon
Rebuild the initramfs and reboot:bash sudo update-initramfs -u sudo reboot
2. Verify the GPU Device Isolation
- Ensure the GPU is correctly isolated and not bound by the host’s drivers. Check the output of
lspci -k
and confirm the GPU is usingvfio-pci
drivers.
3. Check VM Configuration in Virt-Manager
- PCI Host Device:
Verify that the GPU is added as a “PCI Host Device” in the VM configuration under “Add Hardware.” - Firmware:
Ensure the VM is usingQ35
chipset andOVMF
(UEFI) firmware, which are often required for modern GPU passthrough.
4. Monitor and Analyze VM Logs
- Check VM logs for errors related to GPU passthrough:
cat /var/log/libvirt/qemu/<vm-name>.log
- Review host system logs using:
bash sudo dmesg | grep -i iommu sudo journalctl -xe | grep -i vfio
5. Ensure Proper GPU Reset Handling
- Reset Scripts:
Some GPUs require a reset to be properly reinitialized after VM shutdown. A reset script might look like:echo 1 > /sys/bus/pci/devices/0000:xx:00.0/remove echo 1 > /sys/bus/pci/rescan
Replacexx:00.0
with your GPU’s PCI address. - Libvirt Hooks:
Automate GPU reset by placing the script in/etc/libvirt/hooks/qemu
, for example:bash #!/bin/bash if [ "$1" = "your-vm-name" ] && [ "$2" = "stopped" ]; then echo 1 > /sys/bus/pci/devices/0000:xx:00.0/remove echo 1 > /sys/bus/pci/rescan fi
Make sure to give it execute permissions:bash sudo chmod +x /etc/libvirt/hooks/qemu
6. Update AMD Drivers in the VM
- Guest OS:
Inside the VM, update to the latest AMD drivers available for your OS. This can usually be done through package managers or by downloading from AMD’s website. - Ensure Compatibility:
Make sure the guest OS and its drivers are compatible with your GPU and the virtualization setup.
7. Test with a Different Kernel or GPU
- Kernel:
Some kernel versions may have better support for GPU passthrough. Test with a newer or different kernel version. - GPU:
If possible, try using a different GPU to determine if the issue is specific to your current GPU model.
Commonly Used Commands and Logs
- Checking IOMMU Groups:
find /sys/kernel/iommu_groups/ -type l
- Listing PCI Devices and Their Drivers:
lspci -k
- System Logs:
sudo dmesg | grep -i vfio
sudo journalctl -xe | grep -i iommu
Conclusion
AMD GPU drivers are not loading after a virtual machine reboot with GPU Passthrough in a virtualized environment on the best GPU dedicated server. It can be challenging due to driver and hardware reset issues. After verifying the IOMMU and VFIO configuration settings, VM configuration, handling GPU reset, monitoring, and analyzing VM logs, you can resolve the overall issue of AMD Driver not loading after the VM reboot. If you face this issue, then you can follow the instructions to remove the errors.