Ubuntu 22.04 and 20.04 containers stuck in "System is booting" up state

perlun · October 16, 2023, 8:21am

Hi,

We have spotted a weird error with LXD and the ubuntu:20.04 and ubuntu:22.04 containers recently. The problem is that the /var/run/nologin file is “eternally present” after bootup of the containers, making it impossible to use ssh or lxc exec to access the content in the container.

When using SSH, we get the famous “System is botting up” error, because the presense of the nologin file:

$ ssh 10.34.32.179
"System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
Connection closed by 10.34.32.179 port 22

We have seen this on various versions of LXD (5.18 on Ubuntu and 5.0.2 on Debian Bookworm, my own machine). Here are the versions of the images where we’ve seen this:

Image versions used

slovdahl@desk:~$ lxc image info ubuntu:20.04
Fingerprint: 1ff1055f3820e9f25c3c49ad3a9dba2e6c585cb5de3440dfe83150ddfc4a643d
Size: 426.11MiB
Architecture: x86_64
Type: container
Public: yes
Timestamps:
    Created: 2023/10/11 00:00 UTC
    Uploaded: 2023/10/11 00:00 UTC
    Expires: 2025/05/29 00:00 UTC
    Last used: never

plundberg@delorean:~$ lxc image info ubuntu:22.04
Fingerprint: b948dd91cd5a8da89f6dcd4949d7189f064cf6d4dc5bd70b7f9b7aff1883babf
Size: 435.08MB
Architecture: x86_64
Type: container
Public: yes
Timestamps:
    Created: 2023/10/10 00:00 UTC
    Uploaded: 2023/10/10 00:00 UTC
    Expires: 2027/06/01 00:00 UTC
    Last used: never
Properties:
    os: ubuntu
    release: jammy
    version: 22.04
    architecture: amd64
    label: release
    serial: 20231010
    description: ubuntu 22.04 LTS amd64 (release) (20231010)
    type: squashfs
Aliases:
    - 22.04
    - 22.04/amd64
    - j
    - j/amd64
    - jammy
    - jammy/amd64
    - lts
    - lts/amd64
    - default
    - default/amd64
Cached: no
Auto update: disabled
Profiles: []

One of my colleagues thought that perhaps ~ubuntu-core-dev/ubuntu/+source/systemd - [no description] could be affecting this, making the startup take longer? However, I saw the /var/run/nologin being present six minutes after bootup, so if there was a timeout I would have expected it to expire by then.

`systemctl status` output

$ lxc exec sure-macaw -- systemctl status
● sure-macaw
    State: starting
     Jobs: 21 queued
   Failed: 3 units
    Since: Mon 2023-10-16 06:08:57 UTC; 4min 24s ago
   CGroup: /
           ├─.lxc 
           │ ├─904 systemctl status
           │ └─905 less
           ├─init.scope 
           │ └─1 /sbin/init
           └─system.slice 
             ├─systemd-networkd.service 
             │ └─295 /lib/systemd/systemd-networkd
             ├─systemd-udevd.service 
             │ └─94 /lib/systemd/systemd-udevd
             ├─cron.service 
             │ └─329 /usr/sbin/cron -f -P
             ├─networkd-dispatcher.service 
             │ └─335 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
             ├─snap-snapd-20092.mount 
             │ └─104 snapfuse /var/lib/snapd/snaps/snapd_20092.snap /snap/snapd/20092 -o ro,nodev,allow_other,suid
             ├─systemd-journald.service 
             │ └─54 /lib/systemd/systemd-journald
             ├─unattended-upgrades.service 
             │ └─362 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
             ├─ssh.service 
             │ └─358 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
             ├─snapd.service 
             │ └─796 /usr/lib/snapd/snapd
             ├─snapd.seeded.service 
             │ └─525 /usr/bin/snap wait system seed.loaded
             ├─rsyslog.service 
             │ └─336 /usr/sbin/rsyslogd -n -iNONE
             ├─systemd-resolved.service 
             │ └─297 /lib/systemd/systemd-resolved
             ├─dbus.service 
             │ └─331 @dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
             └─systemd-logind.service 
               └─340 /lib/systemd/systemd-logind

Any other logs/diagnostics we could provide? I’ll happily help debug this, but I think we’d need a pointer in what direction to look.

perlun · October 17, 2023, 8:31am

Hmm, are we the only ones running into this? It’s an incredibly annoying problem since it basically makes LXD more or less “useless” with these images.

I tried with the ubuntu-daily:22.04 image instead, which seems to have the same problem. Image details below.

ubuntu:18.04 works flawlessly, so this is almost certainly something image-related AFAICT.

$ lxc image info ubuntu-daily:22.04
Fingerprint: 70f2487b0dd8c9c56edb686b37f5c3317270a9d7a106164f1a4f2ada67597f94
Size: 435.11MB
Architecture: x86_64
Type: container
Public: yes
Timestamps:
    Created: 2023/10/12 00:00 UTC
    Uploaded: 2023/10/12 00:00 UTC
    Expires: 2027/06/01 00:00 UTC
    Last used: never
Properties:
    serial: 20231012
    description: ubuntu 22.04 LTS amd64 (daily) (20231012)
    type: squashfs
    os: ubuntu
    release: jammy
    version: 22.04
    architecture: amd64
    label: daily
Aliases:
    - 22.04
    - 22.04/amd64
    - j
    - j/amd64
    - jammy
    - jammy/amd64
    - lts
    - lts/amd64
    - default
    - default/amd64
Cached: no
Auto update: disabled
Profiles: []

perlun · October 17, 2023, 9:48am

Hmm, interesting… We concluded that this works on some of our machines, and fails on others. I tried the brutal approach of wiping my LXD installation completely and recreating it, but still the same problem.

Here’s some data from the syslog inside the container; I managed to get into it with lxc shell after waiting for perhaps ~2 minutes.

root@guided-mudfish:~# tail /var/log/syslog 
Oct 17 09:42:00 guided-mudfish snapd[808]: daemon.go:340: adjusting startup timeout by 30s (pessimistic estimate of 30s plus 5s per snap)
Oct 17 09:42:00 guided-mudfish snapd[808]: backends.go:58: AppArmor status: apparmor is enabled but some kernel features are missing: dbus, network
Oct 17 09:42:00 guided-mudfish systemd[1]: Started Snap Daemon.
Oct 17 09:42:00 guided-mudfish systemd[1]: snap-snapd-20092.mount: Deactivated successfully.
Oct 17 09:42:00 guided-mudfish systemd[1]: snap-snapd-20092.mount: Unit process 103 (snapfuse) remains running after unit stopped.
Oct 17 09:42:00 guided-mudfish systemd[1]: snap-snapd-20092.mount: Consumed 1.724s CPU time.
Oct 17 09:42:00 guided-mudfish systemd[1]: Reloading.
Oct 17 09:42:00 guided-mudfish systemd[1]: Cannot find unit for notify message of PID 903, ignoring.
Oct 17 09:42:00 guided-mudfish snapd[808]: handlers.go:662: Reported install problem for "snapd" as Crash report successfully submitted.
Oct 17 09:42:19 guided-mudfish systemd-networkd[296]: Could not set hostname: Method call timed out

perlun · October 18, 2023, 11:29am

Here’s an AskUbuntu thread where the poster seems to have run into similar problems as we do: https://askubuntu.com/questions/1484049/lxd-ubuntu-image-startup-issue-snap-lxd-activate-service-not-found/1489496#1489496. It seems to be related to “pre-seeded snaps” for snapd in this container. More details there in that thread, but feel free to continue here if you so prefer.

slovdahl · October 18, 2023, 12:00pm

https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1878225 and https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1806070 describe similar problems too, but unclear if the root cause is the same.

perlun · October 19, 2023, 5:52am

Cross-posting our current workaround from the linked askubuntu thread, in case it helps others. This should not be marked as the “Solution” though, since this is much more of a workaround than anything else.

Workaround

When you do apt-get purge snapd and apt-get install snapd, these snaps are not pre-seeded any more. So that’s why it “works”. But it would sure be interesting to understand more about why this happens and why only on some of our machines…

Some more details from our debugging effort

It seems clear that this is the snapd.seeded.service which is blocking the startup:

root@emerging-caiman:~# journalctl -u snapd.seeded.service
Oct 18 11:02:19 emerging-caiman systemd[1]: Starting Wait until snapd is fully seeded...

snap debug seeding gives us this:

seeded:  false
seed-error: |
  cannot perform the following tasks:
  - Start snap "lxd" (24322) services (systemctl command [start snap.lxd.activate.service] failed
  with exit status 1: Job for snap.lxd.activate.service failed because the control process exited
  with error code.
  See "systemctl status snap.lxd.activate.service" and "journalctl -xeu snap.lxd.activate.service"
  for details.
  )
preseeded:         true
image-preseeding:  4.005s
seed-completion:   –
preseed-system-key: {
  "apparmor-features": [
    "caps",
    "dbus",
    "domain",
    "file",
    "mount",
    "namespaces",
    "network",
    "network_v8",
    "policy",
    "ptrace",
    "query",
    "rlimit",
    "signal"
  ],
  "apparmor-parser-features": [
    "cap-audit-read",
    "cap-bpf",
    "include-if-exists",
    "mqueue",
    "qipcrtr-socket",
    "snapd-internal",
    "unsafe",
    "userns",
    "xdp"
  ],
  "apparmor-parser-mtime": 1692983915,
  "build-id": "55447a37514c4a317439786251326b5f762d31392f6f6b7835704d635279724779346e4d6e78414a6a2f6d647a5247354a536e6e38616e6c31636c5954612f38496435624e72466c744770475332794967704c",
  "cgroup-version": "2",
  "nfs-home": false,
  "overlay-root": "",
  "seccomp-compiler-version": "0a51bc642597bb018aeaaeea931b5cf033bb47d9 2.5.4 c3c9b282ef3c8dfcc3124b2aeaef62f56b813bfd21f8806b30a6c9dbc2e6e58d bpf-actlog",
  "seccomp-features": [
    "allow",
    "errno",
    "kill_process",
    "kill_thread",
    "log",
    "trace",
    "trap",
    "user_notif"
  ],
  "version": 10
}
seed-restart-system-key: {
  "apparmor-features": [
    "caps",
    "domain",
    "file",
    "mount",
    "namespaces",
    "network_v8",
    "policy",
    "ptrace",
    "query",
    "rlimit",
    "signal"
  ],
  "apparmor-parser-features": [
    "cap-audit-read",
    "cap-bpf",
    "include-if-exists",
    "mqueue",
    "qipcrtr-socket",
    "snapd-internal",
    "unsafe",
    "userns",
    "xdp"
  ],
  "apparmor-parser-mtime": 1692983915,
  "build-id": "55447a37514c4a317439786251326b5f762d31392f6f6b7835704d635279724779346e4d6e78414a6a2f6d647a5247354a536e6e38616e6c31636c5954612f38496435624e72466c744770475332794967704c",
  "cgroup-version": "2",
  "nfs-home": false,
  "overlay-root": "",
  "seccomp-compiler-version": "0a51bc642597bb018aeaaeea931b5cf033bb47d9 2.5.4 c3c9b282ef3c8dfcc3124b2aeaef62f56b813bfd21f8806b30a6c9dbc2e6e58d bpf-actlog",
  "seccomp-features": [
    "allow",
    "errno",
    "kill_process",
    "kill_thread",
    "log",
    "trace",
    "trap",
    "user_notif"
  ],
  "version": 10
}

`systemctl status snap.lxd.activate.service`

root@settling-monster:~# systemctl status snap.lxd.activate.service
× snap.lxd.activate.service
     Loaded: not-found (Reason: Unit snap.lxd.activate.service not found.)
     Active: failed (Result: exit-code) since Wed 2023-10-18 11:10:52 UTC; 1min 2s ago
   Main PID: 608 (code=exited, status=1/FAILURE)
        CPU: 36ms

Oct 18 11:10:52 settling-monster systemd[1]: Starting Service for snap application lxd.activate...
Oct 18 11:10:52 settling-monster systemd[1]: snap.lxd.activate.service: Main process exited, code=exited, status=1/FAILURE
Oct 18 11:10:52 settling-monster systemd[1]: snap.lxd.activate.service: Failed with result 'exit-code'.
Oct 18 11:10:52 settling-monster systemd[1]: Failed to start Service for snap application lxd.activate.

`journalctl -xeu snap.lxd.activate.service`

-- Boot 068d18bbbcca4cd5b6c10fafcaadde33 --
Oct 19 05:43:35 adequate-foal systemd[1]: Starting Service for snap application lxd.activate...
░░ Subject: A start job for unit snap.lxd.activate.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ A start job for unit snap.lxd.activate.service has begun execution.
░░ 
░░ The job identifier is 534.
Oct 19 05:43:36 adequate-foal systemd[1]: snap.lxd.activate.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ An ExecStart= process belonging to unit snap.lxd.activate.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Oct 19 05:43:36 adequate-foal systemd[1]: snap.lxd.activate.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ The unit snap.lxd.activate.service has entered the 'failed' state with result 'exit-code'.
Oct 19 05:43:36 adequate-foal systemd[1]: Failed to start Service for snap application lxd.activate.
░░ Subject: A start job for unit snap.lxd.activate.service has failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ A start job for unit snap.lxd.activate.service has finished with a failure.
░░ 
░░ The job identifier is 534 and the job result is failed.

Anything else we can run to get more info on this? It’s incredibly frustrating to not be able to use the ubuntu:22.04 upstream image. There must be something on these particular desktop workstations (one Ubuntu 22.04, one Debian 12) which triggers this…

I even tried reinstalling and reconfiguring lxd from scratch (on that Debian 12 machine), using the default settings, to no avail.

perlun · October 23, 2023, 10:15am

We still haven’t found the exact cause of this, but: it only happens on machines running non-Ubuntu kernels (xanmod, Debian stock kernel). So it seems like it’s somehow some of the Canonical/Ubuntu-only patches that makes the difference in this case…

Ubuntu 22.04 and 20.04 containers stuck in "System is booting" up state

Image versions used

systemctl status output

Workaround

Some more details from our debugging effort

systemctl status snap.lxd.activate.service

journalctl -xeu snap.lxd.activate.service

`systemctl status` output

`systemctl status snap.lxd.activate.service`

`journalctl -xeu snap.lxd.activate.service`