Hi, I use LXD to build instance cluster on bare metal server (96 Core, 1.5TB Memory).
Memory limit of these instance is 256GB~512GB.
Sometime some instance hang up. I can get instance info with lxc info
command, But I can’t exec lxc exec
on this instance, and can’t restart or stop it.
Sometime, LXD command hung, Failed to retrieve network information via netlink. When I exec command lxc stop -f shpc-564-instance-F1cFWdqG --debug
, it hang up, but I can stop other instance on the same lxd node.
I find some error message from lxd log like this
time="2023-07-04T14:08:46Z" level=warning msg="Failed to retrieve network information via netlink" instance=shpc-564-instance-F1cFWdqG instanceType=container pid=2673237 project=default
time="2023-07-04T14:08:46Z" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 2673237 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=shpc-564-instance-F1cFWdqG instanceType=container pid=2673237 project=default
When I exec command lxd info shpc-564-instance-F1cFWdqG
, I can not get network info also.
My LXD version is 5.15, kernel version is Linux 5.15.0-71-generic x86_64.
It seems that I can not control instance when PID is -1. I can not stop\restart these instance.
I exec command sudo systemctl restart snap.lxd.daemon
to restart LXD daemon, but I can not stop the instance and got same error: Failed to run: /snap/lxd/current/bin/lxd forknet
. When I reboot the host, I can control this instance again, but it‘s not a good idea because I have many instances on one host. If I reboot host, all instance on host need to be stop.
Any suggest about this issue? thank u
Could you please provide more logs on this issue, and enable debug logging? A running instance shouldn’t have PID -1 so that’s odd.
Hi, thank you for answer.
I will post more log if I meet this problem again.
Today, I meet other problem: I can not control a container, PID of it is not -1, but I can not stop or restart it, same as PID is -1.
output of lxc info --show-log xxx
is
lxc shpc-1774-instance-bcODVe3f 20230802154313.582 WARN conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc shpc-1774-instance-bcODVe3f 20230802154313.583 ERROR attach - ../src/src/lxc/attach.c:lxc_attach:1611 - Cannot allocate memory - Failed to clone attached process
lxc shpc-1774-instance-bcODVe3f 20230802154413.598 WARN conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc shpc-1774-instance-bcODVe3f 20230802154413.598 WARN conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc shpc-1774-instance-bcODVe3f 20230802154413.599 ERROR attach - ../src/src/lxc/attach.c:lxc_attach:1611 - Cannot allocate memory - Failed to clone attached process
lxc shpc-1774-instance-bcODVe3f 20230802154513.615 WARN conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc shpc-1774-instance-bcODVe3f 20230802154513.615 WARN conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc shpc-1774-instance-bcODVe3f 20230802154513.617 ERROR attach - ../src/src/lxc/attach.c:lxc_attach:1611 - Cannot allocate memory - Failed to clone attached process
output of debug lxd log is
time="2023-08-02T15:57:06Z" level=warning msg="Failed to retrieve network information via netlink" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
time="2023-08-02T15:57:06Z" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 138824 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
time="2023-08-02T15:57:14Z" level=warning msg="Rejecting heartbeat request as shutting down"
time="2023-08-02T15:57:21Z" level=warning msg="Failed to retrieve network information via netlink" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
time="2023-08-02T15:57:21Z" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 138824 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
time="2023-08-02T15:57:26Z" level=warning msg="Rejecting heartbeat request as shutting down"
time="2023-08-02T15:57:36Z" level=warning msg="Failed to retrieve network information via netlink" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
time="2023-08-02T15:57:36Z" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 138824 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=shpc-1774-instance-bcODVe3f instanceType=container pid=138824 project=default
I have a hard time reproducing this.
You say you cannot control the instance anymore. Is the instance still running, i.e is the PID still there? The error Failed setns to container network namespace: No such file or directory
suggests that the PID is not available and the container therefore is not running anymore.