Ubuntu 22.04 Kernel Update Impact on Azure CVMs

Update: A new linux-image package was respun containing the workaround. Users should just need to run apt update; apt upgrade linux-image-6.8.0-1014-azure-fde to receive the changes if they are still running 6.8 on CVMs. Unattended upgrades will pick it up automatically where enabled.

Original Post: On September 12th, we promoted the azure-tuned kernel from 6.5 to 6.8 for Ubuntu 22.04. Subsequently, we discovered that Confidential VM (CVM) instances began experiencing kernel panics post-reboot.

This problem affects Jammy CVM instances, both with and without Full Disk Encryption (FDE). We believe other Ubuntu releases and non-CVM instances are not impacted.

We have removed the affected packages from the repositories while we investigate and develop a fix. Our team is actively working on resolving this issue, exploring potential issues with the EFI stub produced at build-time and related dependencies.

The affected 6.8 kernel was available in the Jammy repositories from September 12th to 16th. If you have a CVM instance that installed these packages during this period, either through automation or manually, it may fail on reboot unless addressed.

If youā€™re running a CVM instance with Ubuntu 22.04, we recommend the following:

  1. To check if you have the affected kernel version installed, run:
if apt list --installed 2>/dev/null | grep -q "linux-image-6.8.0-1014-azure-fde"; then
  echo "Problematic update present"
else
  echo "Not affected"
fi
  1. If the package is installed, do not reboot your instance.

  2. To revert to the earlier version of the kernel, run:

    apt purge -y linux-image-6.8.0-1014-azure-fde
    
    

Please note that, if you have already rebooted and encountered issues, recovery for CVM instances is complex because of their use of nullboot instead of grub. If you have a support relationship with either Canonical or Microsoft, please contact your support representative for assistance.

Once a corrected version of the kernel package is available, we will make it accessible through the normal update process and provide an update on this thread.

We appreciate your patience and understanding as we work to resolve this matter. If you have any questions or concerns, please donā€™t hesitate to reach out.

1 Like

:wave: For what itā€™s worth, I believe the revert impacts the non-fde Azure cloud kernel in general. While the 6.8 kernel appears to function without issue on these nodes, the ā€˜hv-kvp-daemon.serviceā€™ fails with ā€˜WARNING: hv_kvp_daemon not found for kernel 6.8.0-1014ā€™.

Iā€™m seeing nodes that had previously picked up the ā€˜linux-image-6.8.0-1014-azureā€™ update now having kernel 6.5 and 6.8 installed while remaining booted into 6.8. It would have been nice if the downgrade back to kernel 6.5 also prompted grub to change the default kernel to 6.5 as well. I confirmed purging the ā€˜linux-image-6.8.0-1014-azureā€™ package and rebooting gets the node back to kernel 6.5 and subsequently, hv-kvp-daemon.service works as expected.

1 Like

You are correct that this impacts non-fde/non-cvm instances as well. linux-azure-fde was originally a separate kernel to accommodate the changes required for CVMs to function as intended, but they were merged after 5.15. The kernel metas are shared as a result, so rolling back changed linux-azure to 6.5 again as well.

This puts things in a bit of an awkward place as the current 22.04 serials are still live with 6.8, and 6.5 instances will still bring in 6.8 if the nvidia packages are installed. fwiw the actual packages themselves were not removed from the archive, the metas were just repointed towards the previous dependences, so all 6.8 packages should still be installable directly (or by using linux-azure-edge for now).

I would only recommend purging 6.8 on specifically CVM images. If you have grub at all (as CVMs use nullboot instead of grub), then staying on 6.8 would be the best current approach. The hv kvp daemon is likely the result of a non-matching linux-cloud-tools-azure version.

Thanks, yeah Iā€™ve been investigating further and everything youā€™ve said appears correct. I believe my issue stems from that ā€˜apt-get dist-upgradeā€™ is still ā€œfindingā€ the 6.8 packages to install i.e. linux-image-6.8.0-1014-azure linux-modules-6.8.0-1014-azure linux-modules-nvidia-550-server-6.8.0-1014-azure linux-objects-nvidia-550-server-6.8.0-1014-azure linux-signatures-nvidia-6.8.0-1014-azure but not including linux-cloud-tools-6.8.0-1014-azure. Instead of purging linux-image-6.8.0-1014-azure I can get hv-kvp-daemon.service happy again by installing linux-cloud-tools-6.8.0-1014-azure explicitly.

Thanks for all of the advice!

It appears to corrupt every backup along the way as well. I experienced this issue on 9/13 at a reboot (but didnā€™t understand the cause at the time) and I found all backups after 9/12 didnā€™t work. I restored from there and kept going. What I did not realize is the cause was this problematic package and so every backup on the new VM is tainted and about 10 days of work appear gone.

This is an Azure VM

How I Recovered My Azure VM Without Using a Backup

I was able to get my Azure VM working without resorting to a backup. Hereā€™s a simplified guide based on this bug report explaining the issue and how to fix it:

Summary: CVM instances use nullboot, not grub, so recovering from this issue involves mounting the OS disk on another VM, removing the problematic Jammy EFI stub, and triggering the fallback to the existing 6.5 kernel. After that, you can purge the 6.8 kernel.

Steps to Fix the Issue:

  1. Use a Working Backup or Accessible Linux VM:
  • Swap out the OS disk with a known working backup (from before 9/12) or attach the broken OS disk to any accessible Linux VM as an extra drive.
  • Identify the attached drive using the command:
lsblk
  • In my case, sda was the temporary working drive and sdc was the original (non-working) drive.
  1. Create a Mount Point:
sudo mkdir /mnt/os_efi
  1. Mount the Non-Working Driveā€™s Root Directory:
sudo mount /dev/sdc1 /mnt/os_efi
  1. Navigate to the EFI Folder:
cd /mnt/os_efi/EFI/ubuntu
  1. List the Contents:
  • Run ls to see the contents. You should find the problematic kernel file:
kernel.efi-6.8.0-1014-azure
  1. Remove the Problematic Kernel File:
sudo rm -f kernel.efi-6.8.0-1014-azure
  1. Unmount the Drive:
sudo umount /mnt/os_efi
  1. Swap the OS Disk Back to the Original VM:
  • Shut down the VM, reattach the original OS disk, and boot it. Your VM should now boot into the system using the fallback 6.5 kernel.
  1. Purge Problematic Kernel Packages:
sudo apt purge linux-image-6.8* linux-headers-6.8*
  1. Reboot the VM:
  • Confirm that everything is working by rebooting the system.
  1. Backup Your VM:
  • After confirming the fix, immediately take a backup of your VM and restore it to ensure your backup system is functioning properly.
  1. Consider Caution with Future Updates:
  • Be skeptical of Canonical packages moving forward and work through the trauma of this situation with your healthcare and/or spiritual advisors.