LXD VM Agent - fails to come online - with LXD 5.15 and large VM resource requirements

I wonder if its an issue with the CPU scheduling change in LXD 5.15:

https://discuss.linuxcontainers.org/t/lxd-5-15-has-been-released/17493#container-pinning-based-on-numa-nodes-5

Thanks for the feedback.

I start the VM from Multipass. I understand --console would be applied to lxc. I would need to reproduce the command line that Multipass provides to lxc (and in turn be able to reproduce the issue in lxc).

Based on my previous exploration, it appears the first revision where the issue starts showing up is 24571 (which belongs to LXD 5.11). The corresponding version of LXD (as reported by snap) is “git-36b345f”.

I tried finding the “36b345f” commit on this web page.

https://github.com/canonical/lxd/compare/lxd-5.10...lxd-5.12

image

This led me to the commit on the following web page (which is part of merge pull request #11456).

https://github.com/canonical/lxd/commit/36b345fbd5789bdcde18251ab2f4898b4467b4b7?diff=split

I would need to understand how to relate these details to the CPU scheduling change in LXD 5.15 (and how this may in turn relate to revision 24571 which first exhibits the issue).

LXD revision 24571 reports that it belongs to LXD 5.11 (as I recall this came from the version reported by the LXD VM Agent).

The following release notes for LXD 5.12 includes mention of pull request #11456.

https://github.com/canonical/lxd/releases/tag/lxd-5.12

This correlates well with the high level LXD version where the issue is first observed (ie. in LXD >= 5.12).

If it started happening in LXD 5.11 then it won’t be to do with the changes in LXD 5.15.

Were you running the latest/edge channel so you were getting the git revisions (starting git- ) of the snap with the hashes?

Normally they start with the version, e.g. 5.15-....

That particular commit is very unlikely to be the cause, but it could be a commit between the last revision you find to work and that one.

Agreed, the changes relate to LXD 5.11 + a commit destined for LXD 5.12.

I manually installed various snap revisions of LXD (eg. using “snap refresh lxd --revision 24571”) to find the first revision which exhibited the issue. I trialled various revision numbers up and down the revision range (randomly and methodically to identify valid revision numbers for the x86 architecture).

Once a snap revision was installed it was possible to establish the corresponding git hash by running “snap list lxd”.

Agreed, that particular commit looks to be at the end of a series of commits associated with revision 24571. The commit range can be viewed as follows.

https://github.com/canonical/lxd/compare/c5795a891064e6ece19aa28f84167d60f57f3399...36b345fbd5789bdcde18251ab2f4898b4467b4b7?diff=split

I have now cloned the LXD git repository and methodically worked through building a series of these commits (by using “git checkout -b [shortened_hash] [hash]” along with “make deps && make”).

After adjusting a couple of paths the new LXD executable was available for use by Multipass.

export PATH="${GO_BASE}/bin:${PATH}"
export LD_LIBRARY_PATH="${GO_BASE}/deps/dqlite/.libs/:${GO_BASE}/deps/raft/.libs/:${LD_LIBRARY_PATH}"

After methodically testing various commits, the issue was further narrowed down to the following commit range.

https://github.com/canonical/lxd/compare/cb76be6ea598b3ff480772e126635d22025c5c81...acc89e6202a049f7dc9c41f7ac3c4d090c22ae5b?diff=split

Further testing indicates that the following commit is responsible for the issue.

https://github.com/canonical/lxd/commit/338beef6a75f4472e4c0235ac71fa3143aac4308?diff=split

There are a couple of issues with the testing approach I have outlined. I’ll try to make adjustments to conform with the advice in the following article. I am thinking this should help with robustly testing various commits from the LXD repo.

I have now confirmed that the revisions and commits detailed earlier in this ticket are accurate (within their individual contexts).

The overall issue is first observed with LXD snap revision 24571 (ie. the previous revision 24561 does not exhibit this issue).

This revision 24571 consists of many components, including lxd, lxc, lxd-agent, qemu, etc. The initial focus was on changes to the lxd codebase. The search has now therefore been widened to include all components built via the following high level repo.

https://github.com/canonical/lxd-pkg-snap

This has narrowed the search down to the following single commit (this also ties in precisely with the build timing of revision 24571).

https://github.com/canonical/lxd-pkg-snap/commit/dadf771712cd09d0548bd1e4dfd0080626b9c4c5

Specifically, the introduction of the following line to the building of edk2 looks to be responsible.

-DSMM_REQUIRE=TRUE

Building the latest full LXD snap without this line has shown that VM’s (with high resource requirements) can launch successfully.

This appears to indicate that we now have a workaround of sorts (ie. manually building the snap at the latest version, without this line).

The next step appears to be understanding the significance of this line (and how it relates to our environment), with the aim of allowing the latest existing snap versions to successfully launch VM’s with high resource requirements.

Are you able to provide feedback on these observations?

1 Like

Thanks for tracking that issue down!

Unfortunately the git commit message doesn’t explain why that option was added.

@amikhalitsyn I know you’re working on the ed2k firmware build at the moment as part of https://github.com/canonical/lxd-pkg-snap/pull/139

I wonder if you could also look at this option and see if it can be removed/changed to accommodate large VMs.

@mp-des did you ever get any console output from the problem VM to show where it was crashing?

You can use lxc start <instance> --console to get the early boot output.

Thanks, this is the extent of the early boot output from the console of a problem VM.

BdsDxe: loading Boot000C "ubuntu" from HD(15,GPT,90D535EA-615B-47A0-9F86-3ED6DC824E84,0x2800,0x35000)/\EFI\ubuntu\shimx64.efi
BdsDxe: starting Boot000C "ubuntu" from HD(15,GPT,90D535EA-615B-47A0-9F86-3ED6DC824E84,0x2800,0x35000)/\EFI\ubuntu\shimx64.efi

At this point the overall qemu-system-x86_64 process continues to sit at around 104.6% CPU usage.

This consists of a single running thread which consumes 100% CPU. This thread is using kvm-vcpu:0 and is performing a KVM_RUN.

A further 19 sleeping threads make up the remaining 4.6% CPU usage.

The following files are open at this point.

OVMF_CODE.4MB.fd
OVMF_VARS.4MB.fd

@mp-des Dear Matthew,
first of all thanks for your investigation! This is invaluable help for us.

I want to clarify a few things:

  • do I understand correctly that you are using nested virtualization? (you have L1 VM from multipass and then you run LXD VM inside it)
  • which processor you have on the machine? It’s AMD or Intel? (I’m asking because KVM implementation is different between AMD (SVM) and Intel (VMX).). If it’s possible give a full model name, please.
  • is it possible for you to make an experiment without using nested virtualization on the same machine, where you have an issue? (I mean to run LXD on the host with your guest OS).

SMM support in Qemu/KVM is a relatively new feature that allows to virtualize SMM mode (system management mode) in the KVM virtual machine.

I believe that Stéphane enabled it because SMM is used to make Secure boot mechanisms from being bypassed. (SMM is not required to support Secure Boot still.)

I’ve briefly checked kernel 5.15 and it lacks some fixes for KVM:

d953540430c5af57f5de97ea9e36253908204027 KVM: nVMX: Unconditionally purge queued/injected events on nested "exit"
764643a6be07445308e492a528197044c801b3ba KVM: nVMX: Snapshot pre-VM-Enter DEBUGCTL for !nested_run_pending case

And I don’t thing that this if a full list.

@mp-des Suggestion is to install 5.19-hwe kernel and try to reproduce issue with it with the same setup as you had.

the issue occurs while multipass is attempting to instantiate a VM by communicating with lxd/lxc/qemu (all components run directly on the guest OS - ie. no nested virtualization).

we are testing on four independent servers (two servers have AMD processors, while the other two have Intel).

the following are the model names of the CPUs across these four servers

  • AMD EPYC 9274F 24-Core Processor
  • Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz

we’ll try updating our Ubuntu 20.04 installations to use the hwe kernel through the following command, followed by a reboot.

sudo DEBIAN_FRONTEND=noninteractive apt-get install -y linux-image-generic-hwe-20.04/focal-updates

ah, if you are using Ubuntu 20.04 (focal), then hwe kernel is 5.15. It’s too old.
Ok, I can suggest you to try installing Mainline kernel build (https://wiki.ubuntu.com/Kernel/MainlineBuilds):
https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.50/

You just need to:

wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.50/amd64/linux-headers-6.1.50-060150-generic_6.1.50-060150.202308301548_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.50/amd64/linux-headers-6.1.50-060150_6.1.50-060150.202308301548_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.50/amd64/linux-image-unsigned-6.1.50-060150-generic_6.1.50-060150.202308301548_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.50/amd64/linux-modules-6.1.50-060150-generic_6.1.50-060150.202308301548_amd64.deb
dpkg -i linux-headers-6.1.50* linux-image-unsigned-6.1.50-060150-generic_6.1.50-060150.202308301548_amd64.deb linux-modules-6.1.50*

and try to test. You will be able to boot to the old kernel (5.15) without any problems later on.

the issue occurs while multipass is attempting to instantiate a VM by communicating with lxd/lxc/qemu (all components run directly on the guest OS - ie. no nested virtualization).

Ah, sorry. I thought that you’ve multipass instance and then you run LXD inside and then run another one VM inside it. Sorry.

thanks, we installed the 6.1.50 Mainline kernel build and have observed the same behaviour as with the 5.15 kernel.

ok, thanks for checking! Just to clarify, you have installed 6.1.50 on the host (physical machine not VM), then rebooted to a new kernel and tried to create L1 (not nested) VM using MAAS, right?

So, then most likely we have no choice and SMM have to be disabled in futher builds.
Thing that I can’t understand is why this behavior is not reproducible on my (and many others) environments. Maybe MAAS uses some weird instance configuration? Couldn’t you show lxc config show <your_instance_name> -e?

Attempting with a smaller CPU count of 12 works with lxd 5.15 when requesting specific combinations of lower resource counts (ie. 12 cpus, 40GB of memory, 30GB disk).

probably that’s why we can’t reproduce it.
Unfortunately I don’t have server with 1TiB of RAM at my home (-:

Attempting with a smaller CPU count of 12 works with lxd 5.15 when requesting specific combinations of lower resource counts (ie. 12 cpus, 40GB of memory, 30GB disk).

Attempting the same combination with 13 cpus fails with the same LXD VM agent issue.

This possibly means a minimum of 13 cpus and 40GB of memory are needed to reproduce the issue.

We also observe that for a fixed combination of resources that are known to fail, sometimes the VM does fully boot. We use a consistent approach to reproducing the issue in both situations.

1 Like

We maybe able to use the partner cloud to test this, please can you chat with @sdeziel1about this?