When I logged into my server today to upgrade software in an LXD container, I noticed that, while the containers were running fine, I couldn’t interact with lxc
anymore. Commands like lxc list
or lxc snapshot
return
Error: LXD unix socket not accessible: Get "http://unix.socket/1.0": EOF
Given that the machine had some updates waiting, I rebooted it, hoping this would also solve the issue, but it didn’t.
It looks like something is wrong with the database. When I run snap start lxd
and do a tail -f /var/snap/lxd/common/lxd/logs/lxd.log
, I get the following:
time="2024-12-13T17:07:17+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2024-12-13T17:07:19+01:00" level=error msg="Invalid configuration key: Unknown key" key=storage.lvm_fstype
time="2024-12-13T17:07:19+01:00" level=error msg="Invalid configuration key: Unknown key" key=storage.lvm_mount_options
time="2024-12-13T17:07:19+01:00" level=error msg="Invalid configuration key: Unknown key" key=storage.lvm_thinpool_name
time="2024-12-13T17:07:19+01:00" level=error msg="Invalid configuration key: Unknown key" key=storage.lvm_volume_size
time="2024-12-13T17:07:29+01:00" level=warning msg="Failed to rollback transaction after error (Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded): sql: transaction has already been committed or rolled back"
time="2024-12-13T17:07:29+01:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded" member=1
time="2024-12-13T17:07:29+01:00" level=warning msg="Dqlite proxy failed" err="first: remote -> local: local error: tls: bad record MAC second: local -> remote: local error: tls: bad record MAC" local="192.168.10.20:8443" name=dqlite remote="192.168.10.20:57876"
time="2024-12-13T17:07:39+01:00" level=warning msg="Failed to rollback transaction after error (Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded): sql: transaction has already been committed or rolled back"
time="2024-12-13T17:07:39+01:00" level=error msg="Failed to start the daemon" err="Failed applying patch \"instance_remove_volatile_last_state_ip_addresses\": Failed removing volatile.*.last_state.ip_addresses config keys: Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded"
time="2024-12-13T17:07:39+01:00" level=warning msg="Could not handover member's responsibilities" err="Failed to transfer leadership: No online voter found"
time="2024-12-13T17:07:39+01:00" level=warning msg="Failed to get current cluster members" err="Failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found"
time="2024-12-13T17:07:39+01:00" level=warning msg="Loading local instances from disk as database is not available" err="Failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found"
time="2024-12-13T17:07:39+01:00" level=warning msg="Failed loading instance" backup_file=/var/snap/lxd/common/lxd/containers/lxc-monitord.log/backup.yaml err="Failed parsing instance backup file from \"/var/snap/lxd/common/lxd/containers/lxc-monitord.log/backup.yaml\": open /var/snap/lxd/common/lxd/containers/lxc-monitord.log/backup.yaml: not a directory" instance=lxc-monitord.log project=default
time="2024-12-13T17:07:39+01:00" level=warning msg="Failed getting raft nodes" err="Failed to begin transaction: sql: database is closed"
time="2024-12-13T17:07:39+01:00" level=warning msg="Failed to get current cluster members" err="Failed to begin transaction: sql: database is closed"
time="2024-12-13T17:07:39+01:00" level=warning msg="Dqlite last entry" index=0 term=0
tail: /var/snap/lxd/common/lxd/logs/lxd.log: file truncated
after which the messages are repeated until I run snap stop lxd
. I guess it’s constantly retrying to start LXD.
Looking at other forum posts containing context deadline exceeded
I see server load being mentioned. However, this machine does nothing else at the moment. It’s supposed to run containers, but those won’t start anymore.
I’m wondering if this has something to do with the number of LXD snapshots I’ve got (because the because searching for the socket error led me to this question). Looking at the number of ZFS snapshots, there should be about 2112 snapshots across 13 containers. Not sure if that’s too many snapshots?
Based on the ZFS snapshot names, the last snapshots were taken on 2024-12-09, which seems to coincide with the release of LXD 6.2. I guess something went wrong during the upgrade of the snap.
Looking more closely at the database, I see that (at a quick glance) most of the 8946 segment files have a single entry, like this: 0000000020049545-0000000020049545
. According to this message that is not a good thing.
Aside from the segment files, I have the following files in /var/snap/lxd/common/lxd/database/global
:
-rw------- 1 root root 7376896 dec 13 17:08 db.bin
-rw------- 1 root root 3876952 dec 13 17:08 db.bin-wal
-rw------- 1 root root 0 jul 15 14:04 dqlite-lock
-rw------- 1 root root 32 jul 12 2023 metadata1
-rw------- 1 root root 32 jul 12 2023 metadata2
-rw------- 1 root root 2479838 dec 13 02:46 snapshot-4730-20047527-2200903410
-rw------- 1 root root 72 dec 13 02:46 snapshot-4730-20047527-2200903410.meta
-rw------- 1 root root 2479712 dec 13 09:36 snapshot-4730-20048551-2225539002
-rw------- 1 root root 72 dec 13 09:36 snapshot-4730-20048551-2225539002.meta
Any suggestions?