Issues with MicroCloud on Raspberry Pi 4

ghibourg1 · November 27, 2023, 7:07pm

Hi, I have been playing with MicroCloud over the weekend as an experiment with 3 Raspberry Pi 4. Unfortunately, it does not seem to work properly with my setup, and I am having difficulties debugging things further. As I do not feel I have enough information to open an issue yet, I am starting this conversation to describe my setup and issues, and hopefully get some pointers from the community on how to debug things.

First of all, the hardware. I am running 3 raspberry pi 4 Model B (8 Gb of RAM each). They are running 22.04. For storage, they all run from a 64 Gb microSD card for the OS, and have a NVMe SSD attached over USB 3.0. They are connected over gigabit ethernet to my network, with three different VLANs. The interfaces are as follow:

eth0:    10.10.0.x/24  # used for management
homelab: 10.10.10.x/24 # used in this case for mDNS
ovn:     unconfigured  # attached to subnet 10.10.11.0/24, with router listening on 10.10.11.1

I initially tried to initialize microcloud but failed because of missing modules in the kernel. Installing the kernel from -proposed solved that issue. I was then able to initialize properly. I have not configured local storage, and configured the external NVMe SSD for Ceph. I used the homelab interface for MicroCloud mDNS and configured OVN with the ovn interface, with the gateway being 10.10.11.1/24, with an available range of 10.10.11.51-10.10.11.254.

I was able to start a container instance of 22.04, but for some reason it seemed to have frozen, and I was not able to access the console logs, the terminal or exec anything into it. Trying to stop the container also failed silently, even if using --force. I was able to get rid of it by rebooting all the nodes, and then deleting the now stopped container.

I was then able to start a new container, and was able to use the terminal for a bit. I tried using rockcraft in destructive-mode in that container, and it seemed to be working. I was expecting that to take a long time and let it continue overnight. The following day, it was still going on, but the container was in a weird state. I am now not able to stop it, or execute anything in it again. I am also not able to start more containers, or vms.

I do not have much logs to go on, but sudo snap logs lxd outputs this:

2023-11-27T13:53:35-05:00 lxd.daemon[2211]: time="2023-11-27T13:53:35-05:00" level=warning msg="Failed to retrieve network information via netlink" instance=u1 instanceType=container pid=3714 project=default
2023-11-27T13:53:35-05:00 lxd.daemon[2211]: time="2023-11-27T13:53:35-05:00" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 3714 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=u1 instanceType=container pid=3714 project=default
2023-11-27T13:53:40-05:00 lxd.daemon[2211]: time="2023-11-27T13:53:40-05:00" level=error msg="Failed to retrieve PID of executing child process" instance=u1 instanceType=container project=default
2023-11-27T13:53:42-05:00 lxd.daemon[2211]: time="2023-11-27T13:53:42-05:00" level=error msg="Failed to retrieve PID of executing child process" instance=u1 instanceType=container project=default
2023-11-27T13:56:53-05:00 lxd.daemon[2211]: time="2023-11-27T13:56:53-05:00" level=warning msg="Failed to retrieve network information via netlink" instance=u1 instanceType=container pid=3714 project=default
2023-11-27T13:56:53-05:00 lxd.daemon[2211]: time="2023-11-27T13:56:53-05:00" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 3714 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=u1 instanceType=container pid=3714 project=default
2023-11-27T13:56:55-05:00 lxd.daemon[2211]: time="2023-11-27T13:56:55-05:00" level=error msg="Failed to retrieve PID of executing child process" instance=u1 instanceType=container project=default
2023-11-27T13:57:31-05:00 lxd.daemon[2211]: time="2023-11-27T13:57:31-05:00" level=error msg="Failed to retrieve PID of executing child process" instance=u1 instanceType=container project=default
2023-11-27T13:57:40-05:00 lxd.daemon[2211]: time="2023-11-27T13:57:40-05:00" level=warning msg="Failed to retrieve network information via netlink" instance=u1 instanceType=container pid=3714 project=default
2023-11-27T13:57:40-05:00 lxd.daemon[2211]: time="2023-11-27T13:57:40-05:00" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 3714 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=u1 instanceType=container pid=3714 project=default

So, I am at this point right now, where MicroCloud seems to report everything operational, but it is completely unusable. Any pointers on what I am doing wrong, or things I could check to make this work would be appreciated.

tomp · December 4, 2023, 4:00pm

Please can you attach the contents of /var/snap/lxd/common/lxd/logs/lxd.log

egelinas · December 4, 2023, 4:57pm

Can you also share the netplan configuration you have on the hosts? Thanks.

ghibourg1 · December 4, 2023, 7:36pm

I have investigated a bit this weekend, and think I have tracked it down to a hardware issue, where the NVMe SSDs are not provided with enough power from the Raspberry Pi. I tried adding usb-storage.quirks to rule that out, and still had issues. I do not yet have the necessary powered hubs to test that solution, but will update once I have that.

Here are some logs to help future users looking for this issue:

dmesg

Dec 04 14:24:56 pi2 kernel: sd 0:0:0:0: [sda] tag#6 uas_eh_abort_handler 0 uas-tag 12 inflight: CMD OUT
Dec 04 14:24:56 pi2 kernel: sd 0:0:0:0: [sda] tag#6 CDB: Write(10) 2a 00 00 0c ec 60 00 00 08 00
Dec 04 14:24:56 pi2 kernel: sd 0:0:0:0: [sda] tag#5 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD OUT
Dec 04 14:24:56 pi2 kernel: sd 0:0:0:0: [sda] tag#5 CDB: Write(10) 2a 00 00 0c ec 58 00 00 08 00
Dec 04 14:24:56 pi2 kernel: sd 0:0:0:0: [sda] tag#4 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD OUT
Dec 04 14:24:56 pi2 kernel: sd 0:0:0:0: [sda] tag#4 CDB: Write(10) 2a 00 00 0c ec 50 00 00 08 00
Dec 04 14:24:56 pi2 kernel: sd 0:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD OUT
Dec 04 14:24:56 pi2 kernel: sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 0c ec 48 00 00 08 00
Dec 04 14:24:56 pi2 kernel: scsi host0: uas_eh_device_reset_handler start
Dec 04 14:24:57 pi2 kernel: usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
Dec 04 14:25:00 pi2 kernel: usb 2-1: Enable of device-initiated U1 failed.
Dec 04 14:25:00 pi2 kernel: usb 2-1: Enable of device-initiated U2 failed.
Dec 04 14:25:00 pi2 kernel: scsi host0: uas_eh_device_reset_handler success

Trying to launch an instance, I do not always get interesting logs, but got this during my latest try:

pi@pi3:~$ sudo cat /var/snap/lxd/common/lxd/logs/lxd.log
time="2023-12-04T14:23:13-05:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2023-12-04T14:28:44-05:00" level=error msg="Error getting disk usage" err="Cannot get disk usage of unmounted volume when ceph.rbd.du is false" instance=u1 instanceType=container project=default

This one shows an error with the underlying storage, but I did not always get such logs in previous tries. For this latest try, the instance I am launching is now stuck at Retrieving image: Unpack: 100% (397.30MB/s).

Here is what the netplan configuration looks like, I do not think there are any issues there:

network:
  version: 2
  ethernets:
    eth0:
      addresses: [ "10.10.0.13/24" ]
      nameservers:
        addresses: [ "10.10.0.2" ]
        search: [ home.arpa ]
      routes:
        - to: default
          via: 10.10.0.1
          metric: 100

  vlans:
    homelab:
      id: 10
      link: eth0
      addresses: [ "10.10.10.13/24" ]
      nameservers:
        addresses: [ "10.10.0.2" ]
      routes:
        - to: default
          via: 10.10.10.1
          metric: 200
    ovn:
      id: 11
      link: eth0

As I said, I will procure some powered hubs an retry with them to confirm the current theory.

egelinas · December 5, 2023, 9:32pm

Regarding netplan, I just wanted to double check / test using a vlan as the ovn uplink. It works, thanks.

ghibourg1 · December 5, 2023, 9:56pm

I got the USB powered hubs today, and was able to redeploy MicroCloud from scratch. I have now been able to run 3 containers each running a rockcraft pack --destructive at the same time for different projects. I was previously not even coming close to doing that, so I think that confirms the issue was power related. Hopefully this thread can help other people finding themselves in the same situation.