Can LXD support huge instance?

Hi, I use LXD to build instance cluster on bare metal server (96 Core, 1.5TB Memory).
Memory limit of these instance is 256GB~512GB.
Sometime some instance hang up. I can get instance info with lxc info command, But I can’t exec lxc exec on this instance, and can’t restart or stop it.

image

Sometime, LXD command hung, Failed to retrieve network information via netlink. When I exec command lxc stop -f shpc-564-instance-F1cFWdqG --debug , it hang up, but I can stop other instance on the same lxd node.


I find some error message from lxd log like this

time="2023-07-04T14:08:46Z" level=warning msg="Failed to retrieve network information via netlink" instance=shpc-564-instance-F1cFWdqG instanceType=container pid=2673237 project=default
time="2023-07-04T14:08:46Z" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 2673237 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=shpc-564-instance-F1cFWdqG instanceType=container pid=2673237 project=default

When I exec command lxd info shpc-564-instance-F1cFWdqG, I can not get network info also.
My LXD version is 5.15, kernel version is Linux 5.15.0-71-generic x86_64.
image

It seems that I can not control instance when PID is -1. I can not stop\restart these instance.
image

I exec command sudo systemctl restart snap.lxd.daemon to restart LXD daemon, but I can not stop the instance and got same error: Failed to run: /snap/lxd/current/bin/lxd forknet. When I reboot the host, I can control this instance again, but it‘s not a good idea because I have many instances on one host. If I reboot host, all instance on host need to be stop.

Any suggest about this issue? thank u

Could you please provide more logs on this issue, and enable debug logging? A running instance shouldn’t have PID -1 so that’s odd.

Hi, thank you for answer.
I will post more log if I meet this problem again.

Today, I meet other problem: I can not control a container, PID of it is not -1, but I can not stop or restart it, same as PID is -1.

image

output of lxc info --show-log xxx is

lxc shpc-1774-instance-bcODVe3f 20230802154313.582 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc shpc-1774-instance-bcODVe3f 20230802154313.583 ERROR    attach - ../src/src/lxc/attach.c:lxc_attach:1611 - Cannot allocate memory - Failed to clone attached process
lxc shpc-1774-instance-bcODVe3f 20230802154413.598 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc shpc-1774-instance-bcODVe3f 20230802154413.598 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc shpc-1774-instance-bcODVe3f 20230802154413.599 ERROR    attach - ../src/src/lxc/attach.c:lxc_attach:1611 - Cannot allocate memory - Failed to clone attached process
lxc shpc-1774-instance-bcODVe3f 20230802154513.615 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc shpc-1774-instance-bcODVe3f 20230802154513.615 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc shpc-1774-instance-bcODVe3f 20230802154513.617 ERROR    attach - ../src/src/lxc/attach.c:lxc_attach:1611 - Cannot allocate memory - Failed to clone attached process

output of debug lxd log is

time="2023-08-02T15:57:06Z" level=warning msg="Failed to retrieve network information via netlink" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
time="2023-08-02T15:57:06Z" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 138824 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
time="2023-08-02T15:57:14Z" level=warning msg="Rejecting heartbeat request as shutting down"
time="2023-08-02T15:57:21Z" level=warning msg="Failed to retrieve network information via netlink" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
time="2023-08-02T15:57:21Z" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 138824 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
time="2023-08-02T15:57:26Z" level=warning msg="Rejecting heartbeat request as shutting down"
time="2023-08-02T15:57:36Z" level=warning msg="Failed to retrieve network information via netlink" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
time="2023-08-02T15:57:36Z" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 138824 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default

I have a hard time reproducing this.

You say you cannot control the instance anymore. Is the instance still running, i.e is the PID still there? The error Failed setns to container network namespace: No such file or directory suggests that the PID is not available and the container therefore is not running anymore.