Lxc snapshot operations blocked by hanging forkfile process

,

Already twice this week I see lxc snapshot and lxc rm container/snapshot just hang indefinitely on a single container.

Investigating further, I have found a forkfile process hanging off of lxd daemon

1422549 ?        Sl   1064:02  \_ lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
 123694 ?        Sl   2361:39      \_ /snap/lxd/current/bin/lxd forkfile -- 3 4 -1 3839408
# ls -al /proc/123694/fd
total 0
dr-x------ 2 root    root     0 Dec  4 14:38 .
dr-xr-xr-x 9 1000000 1000000  0 Dec  4 14:35 ..
lr-x------ 1 root    root    64 Dec  4 14:38 0 -> /dev/null
l-wx------ 1 root    root    64 Dec  4 14:38 1 -> /dev/null
lrwx------ 1 root    root    64 Dec  4 14:38 10 -> 'socket:[4089359702]'
lr-x------ 1 root    root    64 Dec  4 14:38 11 -> /backups/git-1701667504-16.4.1-ee.0_gitlab_backup.tar
lrwx------ 1 root    root    64 Dec  4 14:38 12 -> 'socket:[4089370174]'
lr-x------ 1 root    root    64 Dec  4 14:38 13 -> /backups/git-1701667504-16.4.1-ee.0_gitlab_backup.tar
lrwx------ 1 root    root    64 Dec  4 14:38 16 -> 'socket:[3661184747]'
l-wx------ 1 root    root    64 Dec  4 14:38 2 -> 'pipe:[4089355519]'
lr-x------ 1 root    root    64 Dec  4 14:38 4 -> /var/snap/lxd/common/lxd/storage-pools/ee/containers/git/rootfs
lr-x------ 1 root    root    64 Dec  4 14:38 5 -> /proc/3839408/ns
lrwx------ 1 root    root    64 Dec  4 14:38 7 -> 'anon_inode:[eventpoll]'
lr-x------ 1 root    root    64 Dec  4 14:38 8 -> 'pipe:[4089359701]'
l-wx------ 1 root    root    64 Dec  4 14:38 9 -> 'pipe:[4089359701]'

once i killed the forkfile all hanging snapshot operations completed (and I had quite a bit since I’m using lxd-snapper)

to me it looks like someone started copying a file from the server and that somehow hung indefinitely (the dates in /proc are from 16 hours earlier) and blocked the snapshot creation/deletion.

Am I interpreting this right? Shouldn’t forkfile have some kind of timeout? How do I debug this further? Shall I open an issue?

Hi @Aleks, there is indeed a timeout for the lxd forkfile process https://github.com/canonical/lxd/blob/main/lxd/main_forkfile.go#L175, but it looks like the timeout applies to inactivity.

Which storage driver are you using? I have just tried copying a big blob from a container (on ZFS) and the deletion doesn’t block, it interrupts the lxc file pull command. What version of LXD are you using?

I’m using zfs, lxd is lxd 5.19-31ff7b6 26093 5.19/stable

I tried the same, pulled a big file over vpn, and while it was pulling, issued lxc snapshot of the same container and it was blocked while the pull was running, like this

~$ time lxc snapshot git

real    0m0.307s
user    0m0.111s
sys     0m0.025s

$ lxc file pull git/backups/git-1701753905-16.4.1-ee.0_gitlab_backup.tar /scratch/
Pulling /scratch/git-1701753905-16.4.1-ee.0_gitlab_backup.tar from backups/git-1701753905-16.4.1-ee.0_gitlab_backup.tar: 2.75GB (109.55MB/s)   

and in another shell

~$ time lxc snapshot git
real    1m14.336s
user    0m0.098s
sys     0m0.064s

normally snapshots takes 1 second, but here it waited until the pull was over

update:

I have talked to my team today and what happened is that an admin started a pull and his laptop ran out of battery in the middle of it, so I guess the idle detection is not working as it should, or maybe I have some specific config that confuses it.

and indeed I have started a pull from my laptop and disconnected my VPN right after and managed to reproduce the exact situation, forkfile process hanging forever, OK, half an hour :slight_smile:, but long after pull timed out on client side

client:
~ > lxc file pull git/backups/git-1701753905-16.4.1-ee.0_gitlab_backup.tar Downloads/
Error: read tcp 192.168.13.27:59181->192.168.220.254:8443: read: operation timed out

server:

# xargs -0 echo < /proc/457438/cmdline; ls -al /proc/457438/cmdline ; date
/snap/lxd/current/bin/lxd forkfile -- 3 4 -1 3839408
-r--r--r-- 1 root root 0 Dec  5 11:48 /proc/457438/cmdline
Tue Dec  5 12:13:26 CET 2023
1 Like

Please can you confirm whether a lxc stop --force <instance> or lxc delete --force <instance> works in these scenarios where a stalled remote client is holding forkfile open for extended periods of time?

I see so a stalled remote client can prevent snapshots taking place.

We can probably added TCP timeouts to help detect a dead client.

Please can you log an issue over at https://github.com/canonical/lxd/issues

Thanks!

opened https://github.com/canonical/lxd/issues/12611

and I can confirm that stop and rm also stall, but stop -f and rm -f both work

1 Like

I tried with rm -f, that explains it :slight_smile:

I opened a draft PR at https://github.com/canonical/lxd/pull/12702