OVS Interfaces are not properly deleted if server reboots

blt2b · August 1, 2023, 9:30am

I got a bunch (hundreds !) of :

Port veth14db1d02
    tag: 150
    Interface veth14db1d02
        error: "could not open network device veth14db1d02 (No such device)"
Port veth6ec5ee74
    tag: 150
    Interface veth6ec5ee74
        error: "could not open network device veth6ec5ee74 (No such device)"
Port vethe87aabaa
    tag: 130
    Interface vethe87aabaa
        error: "could not open network device vethe87aabaa (No such device)"
Port vethad50a974
    tag: 140
    Interface vethad50a974
        error: "could not open network device vethad50a974 (No such device)"
Port vethf93ee6b9
    tag: 140
    Interface vethf93ee6b9
        error: "could not open network device vethf93ee6b9 (No such device)"
Port vethef5f431e
    tag: 130
    Interface vethef5f431e
        error: "could not open network device vethef5f431e (No such device)"
Port veth55e1d15d
    tag: 110
    Interface veth55e1d15d
        error: "could not open network device veth55e1d15d (No such device)"
Port veth8123f7d7
    tag: 150
    Interface veth8123f7d7
        error: "could not open network device veth8123f7d7 (No such device)"
Port veth772ae868
    tag: 130
    Interface veth772ae868
        error: "could not open network device veth772ae868 (No such device)"

On my openvswitch.

When starting, stopping or restarting a container all is working fine as the interface pop-up and is decommissioned when I stop it…

But when I restart my server, the containers are not poweedr-off properly or the ovs is not refreshed and the interfaces are still in the ovs db.

With 30/50 containers, and after a couple of reboots it creates a huge number of interfaces

tomp · August 1, 2023, 9:41am

Hi,

You’ve not mentioned what version of LXD you are using?

blt2b · August 1, 2023, 10:08am

Sorry … lxd 5.0.2 from Alpine 3.18

tomp · August 1, 2023, 10:19am

I believe there is a lxd-feature package in Alpine with the current release, do you experience the same issue on that?

Also, if you do lxd shutdown do you see the newly created ports cleaned up (which confirms they are cleaned up if lxd is shutdown cleanly).

blt2b · August 1, 2023, 1:33pm

thanks @tomp I will try first lxd shutdown and lxd-feature in a second time. I just need to planify as I have some container in production, I hope in ~ 2 weeks time

m0nkey_br4in · August 9, 2023, 12:48pm

I’m having the same (or nearly) issue on lxd 5.15 installed from snap on ubuntu 22.04, so it doesn’t depend on distribution. Neither openvswitch.builtin nor openvswitch.external are defined via snap config. The “nearly” part comes from I only see this issue on power loss, not on planned server reboots.

ovs0 bridge is not “managed” in my case

tomp · August 9, 2023, 3:05pm

Please can you open an issue with your setup and reproducer steps. Thanks

blt2b · August 18, 2023, 8:07am

So, I tried as you suggested lxd shutdown, it did nothing… even after waiting for more than 5 minutes.
I opened another ssh session to my host and lxc list, all my containers are still running.
Canceling the command and trying again is giving : Error: Shutdown already in progress.

After about 10 minutes, I see my service lxd status crashed

Is there anywhere to look at to see what is hanging/preventing lxd to shutdown properly ?

The only error from lxc monitor --type=logging --pretty is
ERROR [2023-08-18T10:11:10+02:00] Failed to stop device device=wlan0 err="Failed to detach interface: \"wlan0\" to \"wlan0\": Failed to run: /usr/sbin/lxd forknet detach -- /proc/27143/fd/4 21198 wlan0 wlan0: exit status 1 (Error: Failed to run: ip link set dev wlan0 netns 21198: exit status 2 (RTNETLINK answers: Invalid argument))" instance=wireless instanceType=container project=default

So, now my best guess is that during a system shutdown, lxd tries to shutdown and after one minute the host is forcing the shutdown and so the ovs interfaces are not properly removed, which is not the case (all the interfaces where cleaned up properly) after the 10 min wait (lxd shutdown)

tomp · August 23, 2023, 3:16pm

Can you try setting this option on the instance to a low duration:

https://documentation.ubuntu.com/lxd/en/latest/reference/instance_options/#instance-boot:boot.host_shutdown_timeout

Then see if LXD will then cleanly shutdown.

This setting will forcefully stop an instance if it takes longer than the specified time to cleanly stop, which may be what is holding up the LXD shutdown.

blt2b · August 29, 2023, 12:35pm

thanks ! is there a simple way to see which instance is causing this ? I have a lot of instances and instead of applying this by instance, would be better by profile or globally ? (but instead all of that perhaps finding the specific instance which make it hangs is a better idea). Appreciate your guidance if any logs can help

tomp · September 4, 2023, 7:30am

You could try setting instances.nic.host_name to mac

https://documentation.ubuntu.com/lxd/en/latest/server/#server-miscellaneous:instances.nic.host_name

Then that would name the host-side interface name as a derivative of the instance NIC’s MAC address.

blt2b · September 4, 2023, 11:59am

thanks but I think I don’t get it how this would indicate which instance is causing the issue as when I have the problem all the virtual nic from all my instances are not properly removed.

Do you think applying this setting would always use the same MAC / name so when the server restarts it would re-used the one created before ?

Just trying to understand, many thanks