When dealing with issues related to AMD GPU drivers not loading after rebooting a VM configured with GPU passthrough on Virt-Manager, several factors could be at play. Here’s a step-by-step guide to diagnose and potentially resolve this issue:
Step-by-Step Troubleshooting Guide
- Verify Host Configuration for IOMMU and VFIO
- Enable IOMMU in BIOS:
- For Intel systems, look for
VT-d
in BIOS settings. - For AMD systems, look for
AMD-Vi
orSVM
.
- For Intel systems, look for
- Enable IOMMU in the Kernel:
Edit/etc/default/grub
and add the following toGRUB_CMDLINE_LINUX_DEFAULT
:intel_iommu=on iommu=pt # For Intel amd_iommu=on iommu=pt # For AMD
Update GRUB and reboot:sudo update-grub sudo reboot
- Bind GPU to VFIO-PCI:
Identify your GPU and audio device’s IDs with:lspci -nn | grep -i vga lspci -nn | grep -i audio
Create or edit/etc/modprobe.d/vfio.conf
:options vfio-pci ids=xxxx:yyyy,aaaa:bbbb
Replacexxxx:yyyy
andaaaa:bbbb
with the respective IDs. - Blacklist Host GPU Drivers:
Prevent the host from loading its own drivers for the GPU. Add the following to/etc/modprobe.d/blacklist.conf
:bash blacklist amdgpu blacklist radeon
Rebuild the initramfs and reboot:bash sudo update-initramfs -u sudo reboot
2. Verify the GPU Device Isolation
- Ensure the GPU is correctly isolated and not bound by the host’s drivers. Check the output of
lspci -k
and confirm the GPU is usingvfio-pci
drivers.
3. Check VM Configuration in Virt-Manager
- PCI Host Device:
Verify that the GPU is added as a “PCI Host Device” in the VM configuration under “Add Hardware.” - Firmware:
Ensure the VM is usingQ35
chipset andOVMF
(UEFI) firmware, which are often required for modern GPU passthrough.
4. Monitor and Analyze VM Logs
- Check VM logs for errors related to GPU passthrough:
cat /var/log/libvirt/qemu/<vm-name>.log
- Review host system logs using:
bash sudo dmesg | grep -i iommu sudo journalctl -xe | grep -i vfio
5. Ensure Proper GPU Reset Handling
- Reset Scripts:
Some GPUs require a reset to be properly reinitialized after VM shutdown. A reset script might look like:echo 1 > /sys/bus/pci/devices/0000:xx:00.0/remove echo 1 > /sys/bus/pci/rescan
Replacexx:00.0
with your GPU’s PCI address. - Libvirt Hooks:
Automate GPU reset by placing the script in/etc/libvirt/hooks/qemu
, for example:bash #!/bin/bash if [ "$1" = "your-vm-name" ] && [ "$2" = "stopped" ]; then echo 1 > /sys/bus/pci/devices/0000:xx:00.0/remove echo 1 > /sys/bus/pci/rescan fi
Make sure to give it execute permissions:bash sudo chmod +x /etc/libvirt/hooks/qemu
6. Update AMD Drivers in the VM
- Guest OS:
Inside the VM, update to the latest AMD drivers available for your OS. This can usually be done through package managers or by downloading from AMD’s website. - Ensure Compatibility:
Make sure the guest OS and its drivers are compatible with your GPU and the virtualization setup.
7. Test with a Different Kernel or GPU
- Kernel:
Some kernel versions may have better support for GPU passthrough. Test with a newer or different kernel version. - GPU:
If possible, try using a different GPU to determine if the issue is specific to your current GPU model.
Commonly Used Commands and Logs
- Checking IOMMU Groups:
find /sys/kernel/iommu_groups/ -type l
- Listing PCI Devices and Their Drivers:
lspci -k
- System Logs:
sudo dmesg | grep -i vfio
sudo journalctl -xe | grep -i iommu
Summary
GPU passthrough with AMD cards in a virtualized environment can be challenging due to driver and hardware reset issues. By systematically verifying IOMMU and VFIO settings, ensuring correct VM configuration, handling GPU resets, and keeping drivers up-to-date, you can resolve or mitigate the problem of the AMD driver not loading after a VM reboot.
If you continue to experience issues, providing detailed logs and specific error messages can help in diagnosing further.