Ceph OSD Fails to Start in LXD Container After Upgrading Host from Ubuntu 22.04.3 to 22.04.4 (kernel from 6.2 to 6.5)

eCoder · February 25, 2024, 4:50pm

Hello,

I have an issue with my Ceph Cluster running on Ubuntu 20.04 plus OpenStack Victoria from Ubuntu Cloud Archive as LXD containers (Compute and OSDs).

So, to be clear, my Ceph OSDs are hosted within LXD containers, and everything functions correctly if the host is Ubuntu 22.04.3 with Linux 6.2.

I currently use LXD version 5.20 installed via SNAP, which works fine on Ubuntu 22.04.3 with Linux 6.2.

However, after upgrading the host from Ubuntu 22.04.3 to 22.04.4, notably the package linux-generic-hwe-22.04 is present, the Ceph OSD daemon (still unchanged with Ubuntu 20.04 + UCA Victoria) within the LXD container fails to start.

Here is a portion of the LXD Profile (which works in Ubuntu 22.04.3 w/ Linux 6.2):

config:
  raw.lxc: |-
    lxc.apparmor.profile = unconfined
    lxc.cgroup2.devices.allow = b 253:* rwm
    lxc.mount.entry = /proc/sys/vm proc/sys/vm proc bind,rw 0 0
    lxc.mount.entry = /proc/sys/fs proc/sys/fs proc bind,rw 0 0
  security.privileged: "true"
description: osds
devices:
...

Here is a portion of the LXD Container for Ceph OSD (which works in Ubuntu 22.04.3 w/ Linux 6.2):

...
devices:
  mapper-control:
    path: /dev/mapper/control
    type: unix-char
  sda:
    path: /dev/sda
    source: /dev/disk/by-id/ata-Kingston_SSD_XYZ
    type: unix-block
  sdc:
    path: /dev/sdc
    source: /dev/disk/by-id/ata-Seagate_HDD_XYSA
    type: unix-block
  sdd:
    path: /dev/sdd
    source: /dev/disk/by-id/ata-Seagate_HDD_XYCZ
    type: unix-block
  sys-fs:
    path: /proc/sys/fs
    source: /proc/sys/fs
    type: disk
  sys-vm:
    path: /proc/sys/vm
    source: /proc/sys/vm
    type: disk
...

Since the host has been upgraded, the Ceph OSD inside the container (Ubuntu 20.04 + UCA Victoria) no longer starts. The following errors are encountered:

[ceph_volume.process][INFO  ] Running command: /usr/sbin/ceph-volume lvm trigger 1-<REMOVED>
[ceph_volume.process][INFO  ] Running command: /usr/sbin/ceph-volume lvm trigger 4-<REMOVED>
[ceph_volume.process][INFO  ] stderr Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-999
/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-block-<REMOVED>/osd-block-<REMOVED> --path /var/lib/ceph/osd/ceph-999 --no-mon-config
abel for /dev/ceph-block-<REMOVED>/osd-block-<REMOVED>: (1) Operation not permitted
400 <STRING> -1 bluestore(/dev/ceph-block-<REMOVED>/osd-block-<REMOVED>) _read_bdev_label failed to open /dev/ceph-block-<REMOVED>/osd-block-<REMOVED>: (1) Operation not permitted
d returned non-zero exit status: 1
[ceph_volume.process][INFO  ] stderr Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9999
/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-block-<REMOVED>/osd-block-<REMOVED> --path /var/lib/ceph/osd/ceph-9999 --no-mon-config
abel for /dev/ceph-block-<REMOVED>/osd-block-<REMOVED>: (1) Operation not permitted
400 <STRING> -1 bluestore(/dev/ceph-block-<REMOVED>/osd-block-<REMOVED>) _read_bdev_label failed to open /dev/ceph-block-<REMOVED>/osd-block-<REMOVED>: (1) Operation not permitted
d returned non-zero exit status: 1
[systemd][WARNING] command returned non-zero exit status: 1
[systemd][WARNING] failed activating OSD, retries left: 1
[systemd][WARNING] command returned non-zero exit status: 1
[systemd][WARNING] failed activating OSD, retries left: 1

As a result, the /var/lib/ceph/osd/ceph-XYZ isn’t being mounted inside the LXD Container, as before the upgrade to Ubuntu 22.04.4 in the host. And Ceph OSD doesn’t show up online in the Ceph Mon controllers.

To debug, I ran:

root@osd-1:~# dd if=/dev/ceph-block-<REMOVED>/osd-block-<REMOVED> of=/tmpdata bs=1024 count=1000
dd: failed to open '/dev/ceph-block-<REMOVED>/osd-block-<REMOVED>': Operation not permitted

It’s like in the Ceph Volume logs above!

NOTE: I’m running /sbin/lvm vgmknodes --refresh as a systemd service in the Ceph OSD container; otherwise, the LVM utilities won’t work, and Ceph Ansible doesn’t even deploy anything, to begin with. I tuned a few options in /etc/lvm/lvm.conf so LVM2 (vgcreate, lvdisplay, etc.) works inside LXD containers.

I intend to continue running my Ceph OSDs as LXD containers on Ubuntu 22.04.4 w/ Linux 6.5 (and also on the next Ubuntu 24.04) while ensuring they function correctly. Currently, the other nodes in the cluster are working as expected (Ceph OSD Inside LXD Containers with host Ubuntu 22.04.3 w/ Linux 6.2) since I’m holding the upgrade to Ubuntu 22.04.4 w/ Linux 6.5.

Please note that I’m running Ceph OSDs inside LXD for my OpenStack Cloud. I’m not using Ceph or LVM for LXD’s backend storage! I’m using LXD containers as physical/virtual machines to host something, such as Ceph OSD in one container and OpenStack Nova Compute/Network in another.

So, given that LXD is the same SNAP package on Ubuntu 22.04.3 and 22.04.4, I expected no issues since the SNAP package itself was not modified, nor were the containers. Only the kernel at the host is new.

If I remove the package linux-generic-hwe-22.04 and all Linux-6.5 packages and reboot back into Linux 6.2, the Ceph OSD inside LXD will work again! So it’s clear that something changed in Linux 6.5, breaking advanced LXD profiles with low-level device access.

A final note: If possible, I would like to run Ceph OSD as unprivileged containers! Even better, without any raw. lxc, if possible, to leverage only LXD infrastructure.

I kindly request your advice, as it prevents me from upgrading my entire infrastructure to Ubuntu 22.04.4 or newer.

NOTE: I also tried ideas from this post: https://chris-sanders.github.io/2018-05-11-block-device-in-containers/ - It didn’t work. Also, I’ve never touched UDEV in my environment.

Reference: https://discuss.linuxcontainers.org/t/ceph-osd-fails-to-start-in-lxd-container-after-upgrading-host-from-ubuntu-20-04-to-22-04/17290 - The same issue happened when I upgraded the LXD host from Ubuntu 20.04 to Ubuntu 22.04, in which people helped and suggested to replace cgroup to cgroup2, which worked! But now the issue is back. Is it cgroup3 now? I also requested help again on that thread, but nobody answered anything, and I’m guessing that it’s because LXD Support has moved here now. So, help! lol

Thank you for any assistance you can provide.

Cheers!

tomp · February 26, 2024, 9:36am

Hi @amikhalitsyn as discussed, please can you take a look at this issue and see if we can help.

Thanks

tomp · February 26, 2024, 9:37am

@eCoder would you be able to provide the output of sudo dmesg when you encounter the problem please?

amikhalitsyn · February 26, 2024, 9:45pm

Hi @eCoder !

First of all, thanks a lot for so great and detailed report about the problem!

That’s clearly a degradation from the kernel side… which is bad and we definitely have to do something with it.

As Tom has suggested, it would be great to see last records from sudo dmesg just after you reproduce these errors inside the container. (We expect to see some logs from AppArmor)

eCoder · February 27, 2024, 4:51pm

Hi @amikhalitsyn!

No problem! I’m happy to report issues so we can improve Ubuntu and LXD.

I stopped the LXD Container, so it won’t be back automatically on the next boot.

Then I installed linux-generic-hwe-22.04 again, and rebooted into Linux 6.5.

Started to tail -F /var/log/kernel.log. And then lxc start osd-1 LXD container.

While the container is running, the Ceph OSD daemon is trying to start (I can see the error in the logs I posted initially in a loop for a while). Nothing relevant shows up on kernel.log or dmesg regarding AppArmor issues. Look:

[   23.157148] bond0: port 14(vnet12) entered blocking state
[   23.157151] bond0: port 14(vnet12) entered forwarding state
[  161.959658] Bluetooth: RFCOMM TTY layer initialized
[  161.959664] Bluetooth: RFCOMM socket layer initialized
[  161.959667] Bluetooth: RFCOMM ver 1.11
[  295.653362] bond0: port 15(vethXYZ) entered blocking state
[  295.653367] bond0: port 15(vethXYZ) entered disabled state
[  295.653387] vethXYZ: entered allmulticast mode
[  295.653616] vethXYZ: entered promiscuous mode
[  295.681273] bond1: port 3(vethXYZ) entered blocking state
[  295.681278] bond1: port 3(vethXYZ) entered disabled state
[  295.681293] vethXYZ: entered allmulticast mode
[  295.681401] vethXYZ: entered promiscuous mode
[  295.681430] bond1: port 3(vethXYZ) entered blocking state
[  295.681431] bond1: port 3(vethXYZ) entered forwarding state
[  295.715227] kauditd_printk_skb: 53 callbacks suppressed
[  295.715229] audit: type=1400 audit(1709049765.804:341): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-cosstor-1_</var/snap/lxd/common/lxd>" pid=5073 comm="apparmor_parser"
[  295.812951] physXXX: renamed from vethXYZ
[  295.888969] physXXX: renamed from vethXYZ
[  295.917236] eth0: renamed from physXXX
[  295.937031] eth1: renamed from physXXX
[  295.961120] bond0: port 15(vethXYZ) entered blocking state
[  295.961124] bond0: port 15(vethXYZ) entered forwarding state
[  296.674162] br1: port 1(eth1.999) entered blocking state
[  296.674169] br1: port 1(eth1.999) entered disabled state
[  296.674185] eth1.999: entered allmulticast mode
[  296.674233] eth1.999: entered promiscuous mode
[  296.674553] br-storage: port 1(eth1.998) entered blocking state
[  296.674557] br-storage: port 1(eth1.998) entered disabled state
[  296.674575] eth1.998: entered allmulticast mode
[  296.674618] eth1.998: entered promiscuous mode
[  296.674864] br0: port 1(eth0.996) entered blocking state
[  296.674868] br0: port 1(eth0.996) entered disabled state
[  296.674879] eth0.996: entered allmulticast mode
[  296.674929] eth0.996: entered promiscuous mode
[  296.675166] br-mgmt: port 1(eth0.997) entered blocking state
[  296.675169] br-mgmt: port 1(eth0.997) entered disabled state
[  296.675180] eth0.997: entered allmulticast mode
[  296.675217] eth0.997: entered promiscuous mode
[  296.675603] eth1: entered allmulticast mode
[  296.675607] eth1: entered promiscuous mode
[  296.675643] br1: port 1(eth1.999) entered blocking state
[  296.675645] br1: port 1(eth1.999) entered forwarding state
[  296.675987] br-storage: port 1(eth1.998) entered blocking state
[  296.675990] br-storage: port 1(eth1.998) entered forwarding state
[  296.676269] eth0: entered allmulticast mode
[  296.676273] eth0: entered promiscuous mode
[  296.676306] br0: port 1(eth0.996) entered blocking state
[  296.676308] br0: port 1(eth0.996) entered forwarding state
[  296.676668] br-mgmt: port 1(eth0.997) entered blocking state
[  296.676671] br-mgmt: port 1(eth0.997) entered forwarding state
[  298.886102] ata1.00: Enabling discard_zeroes_data
[  309.553576] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[  317.397932] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[  330.506376] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[  345.722791] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[  361.926948] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[  367.098976] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[  377.631074] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[  377.691080] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[  382.843057] NOHZ tick-stop error: local softirq work is pending, handler #200!!!

Then, I reverted the changes, rebooted back to Linux 6.2, and the Ceph OSD is back online:

[   22.814765] bond0: port 14(vnet12) entered blocking state
[   22.814767] bond0: port 14(vnet12) entered forwarding state
[ 1356.611319] Bluetooth: RFCOMM TTY layer initialized
[ 1356.611324] Bluetooth: RFCOMM socket layer initialized
[ 1356.611328] Bluetooth: RFCOMM ver 1.11
[ 1401.632411] bond0: port 15(vethXYZ) entered blocking state
[ 1401.632416] bond0: port 15(vethXYZ) entered disabled state
[ 1401.632637] device vethXYZ entered promiscuous mode
[ 1401.653111] bond1: port 3(vethXYZ) entered blocking state
[ 1401.653115] bond1: port 3(vethXYZ) entered disabled state
[ 1401.653393] device vethXYZ entered promiscuous mode
[ 1401.653432] bond1: port 3(vethXYZ) entered blocking state
[ 1401.653434] bond1: port 3(vethXYZ) entered forwarding state
[ 1401.688786] kauditd_printk_skb: 51 callbacks suppressed
[ 1401.688787] audit: type=1400 audit(1709051807.560:285): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-cosstor-1_</var/snap/lxd/common/lxd>" pid=6251 comm="apparmor_parser"
[ 1401.824800] physXXX: renamed from vethXYZ
[ 1401.880709] physXXX: renamed from vethXYZ
[ 1401.934088] eth0: renamed from physXXX
[ 1401.969118] eth1: renamed from physXXX
[ 1401.985069] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1401.985156] bond0: port 15(vethXYZ) entered blocking state
[ 1401.985164] bond0: port 15(vethXYZ) entered forwarding state
[ 1402.733495] br-storage: port 1(eth1.998) entered blocking state
[ 1402.733503] br-storage: port 1(eth1.998) entered disabled state
[ 1402.733597] device eth1.183 entered promiscuous mode
[ 1402.734129] br1: port 1(eth1.999) entered blocking state
[ 1402.734135] br1: port 1(eth1.999) entered disabled state
[ 1402.734218] device eth1.184 entered promiscuous mode
[ 1402.734610] br0: port 1(eth0.996) entered blocking state
[ 1402.734615] br0: port 1(eth0.996) entered disabled state
[ 1402.734679] device eth0.996 entered promiscuous mode
[ 1402.735052] br-mgmt: port 1(eth0.997) entered blocking state
[ 1402.735057] br-mgmt: port 1(eth0.997) entered disabled state
[ 1402.735119] device eth0.997 entered promiscuous mode
[ 1402.735829] device eth1 entered promiscuous mode
[ 1402.735892] br-storage: port 1(eth1.998) entered blocking state
[ 1402.735896] br-storage: port 1(eth1.998) entered forwarding state
[ 1402.736318] br1: port 1(eth1.999) entered blocking state
[ 1402.736324] br1: port 1(eth1.999) entered forwarding state
[ 1402.736758] device eth0 entered promiscuous mode
[ 1402.736817] br0: port 1(eth0.996) entered blocking state
[ 1402.736820] br0: port 1(eth0.996) entered forwarding state
[ 1402.737211] br-mgmt: port 1(eth0.997) entered blocking state
[ 1402.737216] br-mgmt: port 1(eth0.997) entered forwarding state
[ 1404.975710] ata1.00: Enabling discard_zeroes_data

This time, I’m not seeing NOHZ tick-stop error: local softirq work is pending, handler #200!!! anymore! Even after 30 minutes of uptime.

And Ceph OSD is back online inside of the LXD Container (Linux 6.2).

Thanks again!

amikhalitsyn · February 29, 2024, 4:08pm

Hi!

Couldn’t you show:

ls -lan /dev/ceph-block-<REMOVED>/

?

What I want to ensure is that major/minor IDs are not changed and you have a corresponding device cgroup rule that allows to access devices.

eCoder · March 16, 2024, 3:40pm

Host Ubuntu 22.04 with Linux 6.2:

manager@lxd-host-1:~$ lsb_release -ra
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

manager@lxd-host-1:~$ uname -a
Linux lxd-host-1 6.2.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct  6 10:23:26 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

LXD Container with Ubuntu 20.04 and Ceph OSD 15.2.17-0ubuntu0.20.04.6:

manager@lxd-host-1:~$ lxc exec lxd-osd-1 -- bash -i

root@lxd-osd-1:~# lsb_release -ra
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.6 LTS
Release:        20.04
Codename:       focal

root@lxd-osd-1:~# ll /dev/mapper/
total 0
drwxr-xr-x  2 root root      140 Mar 11 12:22 ./
drwxr-xr-x 13 root root      680 Mar 11 12:22 ../
brw-rw----  1 ceph ceph 253,   3 Mar 16 09:12 ceph--block--<REMOVED>
brw-rw----  1 ceph ceph 253,   2 Mar 16 09:12 ceph--block--<REMOVED>
brw-rw----  1 ceph ceph 253,   1 Mar 16 09:12 ceph--block--dbs--<REMOVED>
brw-rw----  1 ceph ceph 253,   0 Mar 16 09:12 ceph--block--dbs--<REMOVED>
crw-rw----  1 root root  10, 236 Mar 11 12:22 control

root@lxd-osd-1:~# service ceph-osd@1 status
● ceph-osd@1.service - Ceph object storage daemon osd.1
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2024-03-11 13:12:52 EDT; 4 days ago
    Process: 1519 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 1 (code=exited, status=0/SUCCESS)
   Main PID: 1527 (ceph-osd)
      Tasks: 60
     Memory: 14.9G
        CPU: 1h 7min 54.264s
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@1.service
             └─1527 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
...

All good!

Now, apt install linux-generic-hwe-22.04 in the Ubuntu 22.04 host and reboot it:

Host Ubuntu 22.04 with Linux 6.5:

manager@lxd-host-1:~$ uname -a
Linux lxd-host-1 6.5.0-25-generic #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

manager@lxd-host-1:~$ lxc exec lxd-osd-1 -- bash -i

root@lxd-osd-1:~# ll /dev/mapper/
total 0
drwxr-xr-x  2 root root      140 Mar 16 09:37 ./
drwxr-xr-x 13 root root      680 Mar 16 09:37 ../
brw-rw----  1 root disk 252,   2 Mar 16 09:37 ceph--block--<REMOVED>
brw-rw----  1 root disk 252,   3 Mar 16 09:37 ceph--block--<REMOVED>
brw-rw----  1 root disk 252,   1 Mar 16 09:37 ceph--block--dbs--<REMOVED>
brw-rw----  1 root disk 252,   0 Mar 16 09:37 ceph--block--dbs--<REMOVED>
crw-rw----  1 root root  10, 236 Mar 16 09:37 control

root@lxd-osd-1:~# service ceph-osd@1 status
● ceph-osd@1.service - Ceph object storage daemon osd.1
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sat 2024-03-16 09:38:21 EDT; 34s ago
    Process: 517 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 1 (code=exited, status=0/SUCCESS)
    Process: 522 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 1 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
   Main PID: 522 (code=exited, status=1/FAILURE)
        CPU: 29ms

Mar 16 09:38:11 lxd-osd-1 systemd[1]: ceph-osd@1.service: Main process exited, code=exited, status=1/FAILURE
Mar 16 09:38:11 lxd-osd-1 systemd[1]: ceph-osd@1.service: Failed with result 'exit-code'.
Mar 16 09:38:21 lxd-osd-1 systemd[1]: ceph-osd@1.service: Scheduled restart job, restart counter is at 3.
Mar 16 09:38:21 lxd-osd-1 systemd[1]: Stopped Ceph object storage daemon osd.1.
Mar 16 09:38:21 lxd-osd-1 systemd[1]: ceph-osd@1.service: Start request repeated too quickly.
Mar 16 09:38:21 lxd-osd-1 systemd[1]: ceph-osd@1.service: Failed with result 'exit-code'.
Mar 16 09:38:21 lxd-osd-1 systemd[1]: Failed to start Ceph object storage daemon osd.1.

Failed.

It’s interesting to note that the major number changed from 253 to 252! Good catch!

Why it changed from 253 (Linux 6.2) to 252 (Linux 6.5)?

Then, I’ve updated the LXD Profile to:

lxc.cgroup2.devices.allow = b 252:* rwm

And now it’s working again!!!

Thank you for pointing me in the right direction, you rock, @amikhalitsyn!!!

NOTE: The following message is appearing only on Linux 6.5:

NOHZ tick-stop error: local softirq work is pending, handler #200!!!

But it seems unrelated to my issue. No idea what is this about.

Cheers!