LXD VM Agent - fails to come online - with LXD 5.15 and large VM resource requirements

mp-des · August 21, 2023, 1:05pm

I am able to reproduce an issue I am experiencing with spinning up the daily Ubuntu 22.04 image in a VM (using multipass 1.12.2 and LXD 5.15 installed via snap revision 25086 on Ubuntu 20.04.6 with a 5.15.78 kernel).

I receive the following error message when attempting to shell into the VM.

Error: LXD VM agent isn’t currently running

Running top shows qemu-system-x86 constantly running at 106.2% (and this never resolves).

VM instance is created via multipass (connected with lxd) and includes a bridge network along with larger settings for CPU, memory and disk (ie. 16 to 66 cpu’s, 66GB to 736GB of memory, and 1300GB disk).

Running with lxd 5.0.2 and lxd 5.11 works fine.
Running with lxd 5.15 (ie. lxd >= 5.12) fails to bring the LXD VM agent online.

lxc info reports -1 processes (consistent with being unable to connect to the LXD VM agent).

Attempting with a smaller CPU count of 12 works with lxd 5.15 when requesting specific combinations of lower resource counts (ie. 12 cpus, 40GB of memory, 30GB disk).

Attempting the same combination with 13 cpus fails with the same LXD VM agent issue.

I have tried reporting some of the standard logging outputs below.

lxc info --project <instance_project> --show-log

Summary

Name: gaeam01-w4
Status: RUNNING
Type: virtual-machine
Architecture: x86_64
PID: 22664
Created: 2023/08/21 18:59 AEST
Last Used: 2023/08/21 18:59 AEST

Resources:
Processes: -1
Network usage:
eth0:
Type: broadcast
State: UP
Host interface: tap84d93e33
MAC address: 52:54:00:61:0d:5e
MTU: 1500
Bytes received: 1.11kB
Bytes sent: 164B
Packets received: 9
Packets sent: 2
IP addresses:
inet6: fd42:76bf:550c:7fee:5054:ff:fe61:d5e/64 (global)
eth1:
Type: broadcast
State: UP
Host interface: tapb3238b31
MAC address: 52:54:00:54:db:7a
MTU: 1500
Bytes received: 8.68kB
Bytes sent: 164B
Packets received: 138
Packets sent: 2
IP addresses:

Log:

sudo lxd.buginfo

[ 8945.831422] perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79750

Are you able to advise the best next steps for debugging and resolving this issue?

tomp · August 21, 2023, 1:30pm

Do you see the same issue using the latest/candidate channel (LXD 5.16)?

mp-des · August 21, 2023, 11:06pm

Yes, I see the same issue using the latest/candidate channel (LXD 5.16 - snap revision 25353).

mp-des · August 22, 2023, 9:25am

I’ve performed a few more specific tests and established the following.

LXD snap revision 24561 works as expected.
LXD snap revision 24571 is the first version which exhibits the issue.

tomp · August 22, 2023, 1:44pm

Can you start the VM with --console and see if it outputs anything useful (a kernel crash perhaps)?

tomp · August 22, 2023, 1:47pm

I wonder if its an issue with the CPU scheduling change in LXD 5.15:

https://discuss.linuxcontainers.org/t/lxd-5-15-has-been-released/17493#container-pinning-based-on-numa-nodes-5

mp-des · August 23, 2023, 11:44am

Thanks for the feedback.

I start the VM from Multipass. I understand --console would be applied to lxc. I would need to reproduce the command line that Multipass provides to lxc (and in turn be able to reproduce the issue in lxc).

Based on my previous exploration, it appears the first revision where the issue starts showing up is 24571 (which belongs to LXD 5.11). The corresponding version of LXD (as reported by snap) is “git-36b345f”.

I tried finding the “36b345f” commit on this web page.

https://github.com/canonical/lxd/compare/lxd-5.10...lxd-5.12

This led me to the commit on the following web page (which is part of merge pull request #11456).

https://github.com/canonical/lxd/commit/36b345fbd5789bdcde18251ab2f4898b4467b4b7?diff=split

I would need to understand how to relate these details to the CPU scheduling change in LXD 5.15 (and how this may in turn relate to revision 24571 which first exhibits the issue).

mp-des · August 23, 2023, 12:15pm

LXD revision 24571 reports that it belongs to LXD 5.11 (as I recall this came from the version reported by the LXD VM Agent).

The following release notes for LXD 5.12 includes mention of pull request #11456.

https://github.com/canonical/lxd/releases/tag/lxd-5.12

This correlates well with the high level LXD version where the issue is first observed (ie. in LXD >= 5.12).

tomp · August 23, 2023, 2:36pm

If it started happening in LXD 5.11 then it won’t be to do with the changes in LXD 5.15.

Were you running the latest/edge channel so you were getting the git revisions (starting git- ) of the snap with the hashes?

Normally they start with the version, e.g. 5.15-....

That particular commit is very unlikely to be the cause, but it could be a commit between the last revision you find to work and that one.

mp-des · August 24, 2023, 1:27pm

Agreed, the changes relate to LXD 5.11 + a commit destined for LXD 5.12.

I manually installed various snap revisions of LXD (eg. using “snap refresh lxd --revision 24571”) to find the first revision which exhibited the issue. I trialled various revision numbers up and down the revision range (randomly and methodically to identify valid revision numbers for the x86 architecture).

Once a snap revision was installed it was possible to establish the corresponding git hash by running “snap list lxd”.

Agreed, that particular commit looks to be at the end of a series of commits associated with revision 24571. The commit range can be viewed as follows.

https://github.com/canonical/lxd/compare/c5795a891064e6ece19aa28f84167d60f57f3399...36b345fbd5789bdcde18251ab2f4898b4467b4b7?diff=split

I have now cloned the LXD git repository and methodically worked through building a series of these commits (by using “git checkout -b [shortened_hash] [hash]” along with “make deps && make”).

After adjusting a couple of paths the new LXD executable was available for use by Multipass.

export PATH="${GO_BASE}/bin:${PATH}"
export LD_LIBRARY_PATH="${GO_BASE}/deps/dqlite/.libs/:${GO_BASE}/deps/raft/.libs/:${LD_LIBRARY_PATH}"

After methodically testing various commits, the issue was further narrowed down to the following commit range.

https://github.com/canonical/lxd/compare/cb76be6ea598b3ff480772e126635d22025c5c81...acc89e6202a049f7dc9c41f7ac3c4d090c22ae5b?diff=split

Further testing indicates that the following commit is responsible for the issue.

https://github.com/canonical/lxd/commit/338beef6a75f4472e4c0235ac71fa3143aac4308?diff=split

mp-des · August 25, 2023, 6:30am

There are a couple of issues with the testing approach I have outlined. I’ll try to make adjustments to conform with the advice in the following article. I am thinking this should help with robustly testing various commits from the LXD repo.

mp-des · September 3, 2023, 9:11pm

I have now confirmed that the revisions and commits detailed earlier in this ticket are accurate (within their individual contexts).

The overall issue is first observed with LXD snap revision 24571 (ie. the previous revision 24561 does not exhibit this issue).

This revision 24571 consists of many components, including lxd, lxc, lxd-agent, qemu, etc. The initial focus was on changes to the lxd codebase. The search has now therefore been widened to include all components built via the following high level repo.

https://github.com/canonical/lxd-pkg-snap

This has narrowed the search down to the following single commit (this also ties in precisely with the build timing of revision 24571).

https://github.com/canonical/lxd-pkg-snap/commit/dadf771712cd09d0548bd1e4dfd0080626b9c4c5

Specifically, the introduction of the following line to the building of edk2 looks to be responsible.

-DSMM_REQUIRE=TRUE

Building the latest full LXD snap without this line has shown that VM’s (with high resource requirements) can launch successfully.

This appears to indicate that we now have a workaround of sorts (ie. manually building the snap at the latest version, without this line).

The next step appears to be understanding the significance of this line (and how it relates to our environment), with the aim of allowing the latest existing snap versions to successfully launch VM’s with high resource requirements.

Are you able to provide feedback on these observations?

tomp · September 4, 2023, 7:10am

Thanks for tracking that issue down!

Unfortunately the git commit message doesn’t explain why that option was added.

@amikhalitsyn I know you’re working on the ed2k firmware build at the moment as part of https://github.com/canonical/lxd-pkg-snap/pull/139

I wonder if you could also look at this option and see if it can be removed/changed to accommodate large VMs.

@mp-des did you ever get any console output from the problem VM to show where it was crashing?

You can use lxc start <instance> --console to get the early boot output.

mp-des · September 5, 2023, 1:39am

Thanks, this is the extent of the early boot output from the console of a problem VM.

BdsDxe: loading Boot000C "ubuntu" from HD(15,GPT,90D535EA-615B-47A0-9F86-3ED6DC824E84,0x2800,0x35000)/\EFI\ubuntu\shimx64.efi
BdsDxe: starting Boot000C "ubuntu" from HD(15,GPT,90D535EA-615B-47A0-9F86-3ED6DC824E84,0x2800,0x35000)/\EFI\ubuntu\shimx64.efi

At this point the overall qemu-system-x86_64 process continues to sit at around 104.6% CPU usage.

This consists of a single running thread which consumes 100% CPU. This thread is using kvm-vcpu:0 and is performing a KVM_RUN.

A further 19 sleeping threads make up the remaining 4.6% CPU usage.

The following files are open at this point.

OVMF_CODE.4MB.fd
OVMF_VARS.4MB.fd

amikhalitsyn · September 6, 2023, 10:47am

@mp-des Dear Matthew,
first of all thanks for your investigation! This is invaluable help for us.

I want to clarify a few things:

do I understand correctly that you are using nested virtualization? (you have L1 VM from multipass and then you run LXD VM inside it)
which processor you have on the machine? It’s AMD or Intel? (I’m asking because KVM implementation is different between AMD (SVM) and Intel (VMX).). If it’s possible give a full model name, please.
is it possible for you to make an experiment without using nested virtualization on the same machine, where you have an issue? (I mean to run LXD on the host with your guest OS).

SMM support in Qemu/KVM is a relatively new feature that allows to virtualize SMM mode (system management mode) in the KVM virtual machine.

I believe that Stéphane enabled it because SMM is used to make Secure boot mechanisms from being bypassed. (SMM is not required to support Secure Boot still.)

amikhalitsyn · September 6, 2023, 11:04am

I’ve briefly checked kernel 5.15 and it lacks some fixes for KVM:

d953540430c5af57f5de97ea9e36253908204027 KVM: nVMX: Unconditionally purge queued/injected events on nested "exit"
764643a6be07445308e492a528197044c801b3ba KVM: nVMX: Snapshot pre-VM-Enter DEBUGCTL for !nested_run_pending case

And I don’t thing that this if a full list.

@mp-des Suggestion is to install 5.19-hwe kernel and try to reproduce issue with it with the same setup as you had.

mp-des · September 6, 2023, 11:40am

the issue occurs while multipass is attempting to instantiate a VM by communicating with lxd/lxc/qemu (all components run directly on the guest OS - ie. no nested virtualization).

we are testing on four independent servers (two servers have AMD processors, while the other two have Intel).

the following are the model names of the CPUs across these four servers

AMD EPYC 9274F 24-Core Processor
Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz

mp-des · September 6, 2023, 11:43am

we’ll try updating our Ubuntu 20.04 installations to use the hwe kernel through the following command, followed by a reboot.

sudo DEBIAN_FRONTEND=noninteractive apt-get install -y linux-image-generic-hwe-20.04/focal-updates

amikhalitsyn · September 6, 2023, 11:55am

ah, if you are using Ubuntu 20.04 (focal), then hwe kernel is 5.15. It’s too old.
Ok, I can suggest you to try installing Mainline kernel build (https://wiki.ubuntu.com/Kernel/MainlineBuilds):
https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.50/

You just need to:

wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.50/amd64/linux-headers-6.1.50-060150-generic_6.1.50-060150.202308301548_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.50/amd64/linux-headers-6.1.50-060150_6.1.50-060150.202308301548_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.50/amd64/linux-image-unsigned-6.1.50-060150-generic_6.1.50-060150.202308301548_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.50/amd64/linux-modules-6.1.50-060150-generic_6.1.50-060150.202308301548_amd64.deb
dpkg -i linux-headers-6.1.50* linux-image-unsigned-6.1.50-060150-generic_6.1.50-060150.202308301548_amd64.deb linux-modules-6.1.50*

and try to test. You will be able to boot to the old kernel (5.15) without any problems later on.

amikhalitsyn · September 6, 2023, 11:57am

the issue occurs while multipass is attempting to instantiate a VM by communicating with lxd/lxc/qemu (all components run directly on the guest OS - ie. no nested virtualization).

Ah, sorry. I thought that you’ve multipass instance and then you run LXD inside and then run another one VM inside it. Sorry.