Lxc shell abnormal closure

ser · August 2, 2023, 5:54am

Hi, I have started to observe recently a strange issue, which did not happen in previous years of using LXD.

I often use lxc shell to connect to containers and virtual machines. Recently it started to hang when server machine is getting more loaded, and after a shortish while it disconnects from the container with Error: websocket: close 1006 (abnormal closure): unexpected EOF. If I try to connect back, it just hangs and does not connect. Host machine is loaded indeed, but it’s not overloaded, when I try to use ssh instead of lxc shell all works smoothly. When load of machine is lower again, lxc shell is also working back.

Do you observe the same on your machines? What can be the reason? How to diagnose it?

tomp · August 2, 2023, 6:55am

What lxd version are you running?

ser · August 2, 2023, 8:22am

$ lxc --version
5.15

tomp · August 2, 2023, 8:52am

Please can you try latest/candidate

ser · August 2, 2023, 9:03am

I can’t, it’s a production machine.

tomp · August 2, 2023, 9:19am

It represents lxd 5.16 currently. As there were some fixes related to exec web sockets to rule out before debugging further.

ser · August 2, 2023, 9:21am

OK, I will report back in few weeks when machines will get restarted with new LXD version. Thanks

tomreijnders · September 15, 2023, 10:07am

I have (still) exact the same issue as OP. Using Ubuntu 20.04 and lxd 5.17 (latest/stable)

tomp · September 15, 2023, 11:34am

Please could you post reproducer steps you use to experience this?

tomreijnders · September 15, 2023, 11:45am

I have a VM (VMware) running Ubuntu 20.04 with several containers (no VM).

I just do a lxc exec bash

When the load on the container is a bit high (the conatiner that starts a docker compose is the container that suffers the most). I get thrown out of the container with the follwing error:
Error: websocket: close 1006 (abnormal closure): unexpected EOF

I can sometimes reenter the container immediately, but sometimes I get the same error again.
Most of the time I can get back into the container the second time. The higher the load (its not a rediculously high load, i.e. something between 3 and 5 with 4 cores) the more the issue occurs.

There is no issue at all if I enter the container with SSH.

This particular VM was originally set up with Ubuntu 18.04 (and LXD from apt). Then I did a do-release-upgrade (some 3 weeks ago) to upgrade to 20.04. During that proces LXD was migrated to snap on the 4.0/stable channel. Then I changed to latest/stable channel.

tomp · September 15, 2023, 11:48am

Yes, the issue we have is that we arent able to reproduce the problem (never have been able to sadly) and thus we are struggling to resolve it.

Do you have specific steps on how you are producing load that is a “bit high”.

tomreijnders · September 15, 2023, 12:05pm

I have the feeling it has more to do with IO than with cpu. But, I don’t know. I’ll try to setup an environment that is easy to mimic and see if I can recreate the issue. In the container that I have the most problems with, I try to setup an edlib development environment.

So, I have a new container (lxc launch ubuntu:22.04 -c security.nesting=true edlib)
Go in the container
snap install docker
Follow the steps in https://docs.edlib.com/docs/developers/getting-started/

git clone https://github.com/cerpus/Edlib.git edlib
cd edlib
cp localSetup/.env.example localSetup/.env
Than there is a bug, so you need to edit sourcecode/apis/contentauthor/Dockerfile and add a slash to the . at the end of line 80 so the line reads:
COPY --from=composer_deps /app/composer.json /app/composer.lock ./
Then do a docker compose up -d

So far I get thrown out during that process. I hope this helps you a bit.

tomp · September 15, 2023, 12:07pm

Thanks, what is the spec of the host VM?

And which kernel version is it running uname -a?

ser · September 15, 2023, 12:17pm

it’s happening all the time to me, host ubuntu 20 or 22, does not matter, lxc guest does not matter either, alpine, all varieties of debian or ubuntu, amd processor

tomreijnders · September 15, 2023, 12:27pm

 uname -a
 Linux dl-vm-v05 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

tomreijnders · September 19, 2023, 8:50am

Another thing I noticed on the VM’s that suffer from this is that:

Command completion takes much longer, i.e. if I type in lxc exec followed by the tab key, it takes much longer for the container name to appear than other VM’s
Opening a bash shell also takes much longer.

Just as extra information. I use a bridge, the bridge is not managed by LXD and the datastore I sue is zfs. This is the default profile:

lxc profile show default
config: {}
description: Default LXD profile
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: zfspool
    type: disk
name: default
used_by:
- /1.0/instances/tplc01
- /1.0/instances/moodle
- /1.0/instances/mongo
- /1.0/instances/edlib

tomp · September 20, 2023, 3:32pm

I’m going to try and recreate this using a LXD VM running the described container above.

How many CPU cores, memory and disk space does your VM have?

ser · September 20, 2023, 3:45pm

in my servers it relates to VMs and containers as well… with no cpu/mem limits.

tomreijnders · September 21, 2023, 10:58am

This particular VM has 4 cores and 16Gb RAM (and 4GB swap, silly number but its because of historic).
The root partition is 20Gb, the zfs pool is 140Gb
Root has 6Gb availble diskspace, the zfspool 40Gb.

Most of my other VM’s that sometimes show the same behaviour have 2 cores / 4Gb (but do not run docker inside the lxd containers)

onegaia · October 4, 2023, 6:36pm

I’ve sent logs (on the old forum) several times when this condition is happening.

ps auxnf
ip a
ip r
sudo ss -ulpn

is this what you’d like to see?