LXD can't restart: probably a Dqlite database problem, maybe related to snapshots (LXD 6.2)

When I logged into my server today to upgrade software in an LXD container, I noticed that, while the containers were running fine, I couldn’t interact with lxc anymore. Commands like lxc list or lxc snapshot return

Error: LXD unix socket not accessible: Get "http://unix.socket/1.0": EOF

Given that the machine had some updates waiting, I rebooted it, hoping this would also solve the issue, but it didn’t.

It looks like something is wrong with the database. When I run snap start lxd and do a tail -f /var/snap/lxd/common/lxd/logs/lxd.log, I get the following:

time="2024-12-13T17:07:17+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2024-12-13T17:07:19+01:00" level=error msg="Invalid configuration key: Unknown key" key=storage.lvm_fstype
time="2024-12-13T17:07:19+01:00" level=error msg="Invalid configuration key: Unknown key" key=storage.lvm_mount_options
time="2024-12-13T17:07:19+01:00" level=error msg="Invalid configuration key: Unknown key" key=storage.lvm_thinpool_name
time="2024-12-13T17:07:19+01:00" level=error msg="Invalid configuration key: Unknown key" key=storage.lvm_volume_size
time="2024-12-13T17:07:29+01:00" level=warning msg="Failed to rollback transaction after error (Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded): sql: transaction has already been committed or rolled back"
time="2024-12-13T17:07:29+01:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded" member=1
time="2024-12-13T17:07:29+01:00" level=warning msg="Dqlite proxy failed" err="first: remote -> local: local error: tls: bad record MAC second: local -> remote: local error: tls: bad record MAC" local="192.168.10.20:8443" name=dqlite remote="192.168.10.20:57876"
time="2024-12-13T17:07:39+01:00" level=warning msg="Failed to rollback transaction after error (Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded): sql: transaction has already been committed or rolled back"
time="2024-12-13T17:07:39+01:00" level=error msg="Failed to start the daemon" err="Failed applying patch \"instance_remove_volatile_last_state_ip_addresses\": Failed removing volatile.*.last_state.ip_addresses config keys: Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded"
time="2024-12-13T17:07:39+01:00" level=warning msg="Could not handover member's responsibilities" err="Failed to transfer leadership: No online voter found"
time="2024-12-13T17:07:39+01:00" level=warning msg="Failed to get current cluster members" err="Failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found"
time="2024-12-13T17:07:39+01:00" level=warning msg="Loading local instances from disk as database is not available" err="Failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found"
time="2024-12-13T17:07:39+01:00" level=warning msg="Failed loading instance" backup_file=/var/snap/lxd/common/lxd/containers/lxc-monitord.log/backup.yaml err="Failed parsing instance backup file from \"/var/snap/lxd/common/lxd/containers/lxc-monitord.log/backup.yaml\": open /var/snap/lxd/common/lxd/containers/lxc-monitord.log/backup.yaml: not a directory" instance=lxc-monitord.log project=default
time="2024-12-13T17:07:39+01:00" level=warning msg="Failed getting raft nodes" err="Failed to begin transaction: sql: database is closed"
time="2024-12-13T17:07:39+01:00" level=warning msg="Failed to get current cluster members" err="Failed to begin transaction: sql: database is closed"
time="2024-12-13T17:07:39+01:00" level=warning msg="Dqlite last entry" index=0 term=0
tail: /var/snap/lxd/common/lxd/logs/lxd.log: file truncated

after which the messages are repeated until I run snap stop lxd. I guess it’s constantly retrying to start LXD.

Looking at other forum posts containing context deadline exceeded I see server load being mentioned. However, this machine does nothing else at the moment. It’s supposed to run containers, but those won’t start anymore.

I’m wondering if this has something to do with the number of LXD snapshots I’ve got (because the because searching for the socket error led me to this question). Looking at the number of ZFS snapshots, there should be about 2112 snapshots across 13 containers. Not sure if that’s too many snapshots?

Based on the ZFS snapshot names, the last snapshots were taken on 2024-12-09, which seems to coincide with the release of LXD 6.2. I guess something went wrong during the upgrade of the snap.

Looking more closely at the database, I see that (at a quick glance) most of the 8946 segment files have a single entry, like this: 0000000020049545-0000000020049545. According to this message that is not a good thing.

Aside from the segment files, I have the following files in /var/snap/lxd/common/lxd/database/global:

-rw------- 1 root root 7376896 dec 13 17:08 db.bin
-rw------- 1 root root 3876952 dec 13 17:08 db.bin-wal
-rw------- 1 root root       0 jul 15 14:04 dqlite-lock
-rw------- 1 root root      32 jul 12  2023 metadata1
-rw------- 1 root root      32 jul 12  2023 metadata2
-rw------- 1 root root 2479838 dec 13 02:46 snapshot-4730-20047527-2200903410
-rw------- 1 root root      72 dec 13 02:46 snapshot-4730-20047527-2200903410.meta
-rw------- 1 root root 2479712 dec 13 09:36 snapshot-4730-20048551-2225539002
-rw------- 1 root root      72 dec 13 09:36 snapshot-4730-20048551-2225539002.meta

Any suggestions?

Continuing trying to fix this…

When I run lxd --debug --group lxd I see:

INFO   [2024-12-15T16:24:06+01:00] Initializing global database                                                                                                                                                                                          
INFO   [2024-12-15T16:24:06+01:00] Connecting to global database                                                                                                                                                                                         
INFO   [2024-12-15T16:24:06+01:00] Dqlite connected outbound                     local="192.168.10.20:47326" name=dqlite remote="192.168.10.20:8443"                                                                                                                                                                                                                                                                                                                                                              
DEBUG  [2024-12-15T16:24:06+01:00] Matched trusted cert                          fingerprint=d3b32001958236b0c81fe5ae606ff9b5fb0a04c7106f4efc248b0732b36a70ae subject="CN=root@jerom2,O=linuxcontainers.org"
INFO   [2024-12-15T16:24:06+01:00] Dqlite proxy started                          local="192.168.10.20:8443" name=dqlite remote="192.168.10.20:47326"
DEBUG  [2024-12-15T16:24:06+01:00] Dqlite: attempt 1: server 192.168.10.20:8443: connected on fallback path                                                                                                                                              
INFO   [2024-12-15T16:24:06+01:00] Connected to global database                                                                                                                                                                                          
INFO   [2024-12-15T16:24:06+01:00] Initialized global database                                                                                       

The IP address 192.168.10.20 is correct (it’s the IP address of the bridge I created for LXD). I’m not sure why it says connected on fallback path, but at least it can connect to the global database.

A bit further down, it seems that the trouble starts:

DEBUG  [2024-12-15T16:24:06+01:00] Checking for patches                          stage=3
INFO   [2024-12-15T16:24:06+01:00] Applying patch                                name=` stage=3
DEBUG  [2024-12-15T16:24:16+01:00] Dqlite: network connection lost: write tcp 192.168.10.20:47326->192.168.10.20:8443: i/o timeout 
DEBUG  [2024-12-15T16:24:16+01:00] Database error                                err="sql: transaction has already been committed or rolled back"
WARNING[2024-12-15T16:24:16+01:00] Failed to rollback transaction after error (Failed to fetch from "instance_snapshot_config" table: Failed to fetch from "instance_snapshot_config" table: context deadline exceeded): sql: transaction has already been committed or rolled back 
WARNING[2024-12-15T16:24:16+01:00] Transaction timed out. Retrying once          err="Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded" member=1
DEBUG  [2024-12-15T16:24:16+01:00] Dqlite: network connection lost: write tcp 192.168.10.20:47326->192.168.10.20:8443: i/o timeout 

Applying the instance_remove_volatile_last_state_ip_addresses patch (whatever that is) apparently fails. And then the connection to the database is lost. I’m not sure what is cause and what is consequence here. Is the fact that the patch can’t be applied the cause if a database crash, or is the fact that the database is somehow dead the reason why the patch can’t be applied? :confused:

If the timeout is a problem (from the timestamps it looks like it is set to 10 seconds), is it possible to increase it somewhere?

After a whole bunch of those instance_remove_volatile_last_state_ip_addresses messages, all with the same timestamp, I get the following:

WARNING[2024-12-15T16:24:16+01:00] Dqlite proxy failed                           err="first: remote -> local: local error: tls: bad record MAC" local="192.168.10.20:8443" name=dqlite remote="192.168.10.20:47326"                                                                                                                                                                                                                                                                                               
INFO   [2024-12-15T16:24:16+01:00] Dqlite proxy stopped                          local="192.168.10.20:8443" name=dqlite remote="192.168.10.20:47326"                                                                                                                                                                                                                                                                                                                                                              
INFO   [2024-12-15T16:24:16+01:00] Dqlite connected outbound                     local="192.168.10.20:49312" name=dqlite remote="192.168.10.20:8443"                                                                                                                                                                                                                                                                                                                                                              
DEBUG  [2024-12-15T16:24:16+01:00] Matched trusted cert                          fingerprint=d3b32001958236b0c81fe5ae606ff9b5fb0a04c7106f4efc248b0732b36a70ae subject="CN=root@jerom2,O=linuxcontainers.org"                                                                                                                                                                                                                                                                                                      
INFO   [2024-12-15T16:24:16+01:00] Dqlite proxy started                          local="192.168.10.20:8443" name=dqlite remote="192.168.10.20:49312"                                                                                                                                                                                                                                                                                                                                                              
DEBUG  [2024-12-15T16:24:16+01:00] Dqlite: attempt 1: server 192.168.10.20:8443: connected on fast path                                                                                                                                                                                                                                                                                                                                                                                                           
DEBUG  [2024-12-15T16:24:26+01:00] Dqlite: network connection lost: write tcp 192.168.10.20:49312->192.168.10.20:8443: i/o timeout                                                                                                                                                                                                                                                                                                                                                                                
DEBUG  [2024-12-15T16:24:26+01:00] Database error                                err="sql: transaction has already been committed or rolled back"                                                                                                                                                                                                                                                                                                                                                                 
WARNING[2024-12-15T16:24:26+01:00] Failed to rollback transaction after error (Failed to fetch from "instance_snapshot_config" table: Failed to fetch from "instance_snapshot_config" table: context deadline exceeded): sql: transaction has already been committed or rolled back                                                                                                                                                                                                                               
DEBUG  [2024-12-15T16:24:26+01:00] Dqlite: network connection lost: write tcp 192.168.10.20:49312->192.168.10.20:8443: i/o timeout                                                                                                                                                                                                                                                                                                                                                                                
DEBUG  [2024-12-15T16:24:26+01:00] Database error                                err="Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded"                                                                                                                                                                                                                                                                                
DEBUG  [2024-12-15T16:24:26+01:00] Dqlite: network connection lost: write tcp 192.168.10.20:49312->192.168.10.20:8443: i/o timeout                                                                                                                                                                                                                                                                                                                                                                                
ERROR  [2024-12-15T16:24:26+01:00] Failed to start the daemon                    err="Failed applying patch \"instance_remove_volatile_last_state_ip_addresses\": Failed removing volatile.*.last_state.ip_addresses config keys: Failed to fetch from \"instance_snapshot_config\" table: Failed to fetch from \"instance_snapshot_config\" table: context deadline exceeded"                                                                                                                                    
INFO   [2024-12-15T16:24:26+01:00] Starting shutdown sequence                    signal=interrupt                                                                                                                                                                                                                                                                                                                                                                                                                 
DEBUG  [2024-12-15T16:24:26+01:00] Dqlite: network connection lost: write tcp 192.168.10.20:49312->192.168.10.20:8443: i/o timeout                                                                                                                                                                                                                                                                                                                                                                                
DEBUG  [2024-12-15T16:24:26+01:00] Dqlite: network connection lost: write tcp 192.168.10.20:49312->192.168.10.20:8443: i/o timeout                                                                                                                                                                                                                                                                                                                                                                                
DEBUG  [2024-12-15T16:24:26+01:00] Dqlite: network connection lost: write tcp 192.168.10.20:49312->192.168.10.20:8443: i/o timeout                                                                                                                                                                                                                                                                                                                                                                                
INFO   [2024-12-15T16:24:26+01:00] Transferring leadership                       address="192.168.10.20:8443"                                                                                                                                                                                                                                                                                                                                                                                                     
DEBUG  [2024-12-15T16:24:26+01:00] Dqlite: network connection lost: write tcp 192.168.10.20:49312->192.168.10.20:8443: i/o timeout                                                                                                                                                                                                                                                                                                                                                                                
WARNING[2024-12-15T16:24:26+01:00] Could not handover member's responsibilities  err="Failed to transfer leadership: No online voter found"                                                                                                                                                                                                                                                                                                                                                                       
DEBUG  [2024-12-15T16:24:26+01:00] Cancel ongoing or future gRPC connection attempts                                                                                                                                                                                                                                                                                                                                                                                                                              
WARNING[2024-12-15T16:24:26+01:00] Dqlite proxy failed                           err="first: remote -> local: local error: tls: bad record MAC" local="192.168.10.20:8443" name=dqlite remote="192.168.10.20:49312"                                                                                                                                                                                                                                                                                               
INFO   [2024-12-15T16:24:26+01:00] Dqlite proxy stopped                          local="192.168.10.20:8443" name=dqlite remote="192.168.10.20:49312"                                                                                                                                                                                                                                                                                                                                                              
WARNING[2024-12-15T16:24:26+01:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1                                                                                                                                                                                                                                                                                                                                                            
DEBUG  [2024-12-15T16:24:26+01:00] Database error                                err="Failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found"                                                                                                                                                                                                                                                                                                                   
WARNING[2024-12-15T16:24:26+01:00] Loading local instances from disk as database is not available  err="Failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found"                                                                                                                                                                                                                                                                                                 
WARNING[2024-12-15T16:24:26+01:00] Failed loading instance                       backup_file=/var/snap/lxd/common/lxd/containers/lxc-monitord.log/backup.yaml err="Failed parsing instance backup file from \"/var/snap/lxd/common/lxd/containers/lxc-monitord.log/backup.yaml\": open /var/snap/lxd/common/lxd/containers/lxc-monitord.log/backup.yaml: not a directory" instance=lxc-monitord.log project=default   

So it seems to try again, but fails again too. That last line about lxc-monitord.log is also strange: on my system (which, admittedly, had a bunch of Ubuntu updates over the years), /var/snap/lxd/common/lxd/containers/lxc-monitord.log is indeed not a directory, but a file:

# ls -l /var/snap/lxd/common/lxd/containers/lxc-monitord.log
-rw-r--r-- 1 root root 50512 sep 30  2018 /var/snap/lxd/common/lxd/containers/lxc-monitord.log

After this, lxd starts shutting down:

DEBUG  [2024-12-15T16:24:26+01:00] Cancel ongoing or future gRPC connection attempts                                                                                                                                                                                                                                                                                                                                                                                                                              
DEBUG  [2024-12-15T16:24:26+01:00] Cancel ongoing or future gRPC connection attempts                                                                                                                                                                                                                                                                                                                                                                                                                              
DEBUG  [2024-12-15T16:24:26+01:00] Database error                                err="Failed to begin transaction: sql: database is closed"                                                                                                                                                                                                                                                                                                                                                                       
WARNING[2024-12-15T16:24:26+01:00] Failed to get current cluster members         err="Failed to begin transaction: sql: database is closed"                                                                                                                                                                                                                                                                                                                                                                       
INFO   [2024-12-15T16:24:26+01:00] Closing the database                                                                                                                                                                                                                                                                                                                                                                                                                                                           
INFO   [2024-12-15T16:24:26+01:00] Stop database gateway                                                                                                                                                                                                                                                                                                                                                                                                                                                          
WARNING[2024-12-15T16:24:26+01:00] Dqlite last entry                             index=0 term=0                                                                                                                                                                                                                                                                                                                                                                                                                   
INFO   [2024-12-15T16:24:26+01:00] Closing socket                                socket="192.168.10.20:8443" type="REST API TCP socket"                                                                                                                                                                                                                                                                                                                                                                           
INFO   [2024-12-15T16:24:26+01:00] Closing socket                                socket=/var/snap/lxd/common/lxd/unix.socket type="REST API Unix socket"                                                                                                                                                                                                                                                                                                                                                          
INFO   [2024-12-15T16:24:26+01:00] Closing socket                                socket=/var/snap/lxd/common/lxd/devlxd/sock type="devlxd socket"                                                                                                                                                                                                                                                                                                                                                                 
INFO   [2024-12-15T16:24:26+01:00] Unmounting temporary filesystems             
INFO   [2024-12-15T16:24:26+01:00] Done unmounting temporary filesystems        
INFO   [2024-12-15T16:24:26+01:00] Daemon stopped                               
Error: Failed applying patch "instance_remove_volatile_last_state_ip_addresses": Failed removing volatile.*.last_state.ip_addresses config keys: Failed to fetch from "instance_snapshot_config" table: Failed to fetch from "instance_snapshot_config" table: context deadline exceeded

Please can you send me the contents of /var/snap/lxd/common/lxd/database and ill take a look.

Is this a standalone server or a cluster?

Did you try reverting the snap to the previous revision?

It may indeed be related to the amount of snapshots if the db operations are taking a long time.

You could also try switching to the LTS for now

snap refresh lxd --channel=5.21/stable

As that does not yet have that patch.

It’s a standalone server. Actually, it started out as a standalone server (in ~2018?). Then early 2022 I created a cluster to which I added this machine. Somewhere in summer 2023 I removed it from the cluster because I needed the other nodes for some other test stuff.

Thanks for the suggestion! I’ll try to downgrade to 5.21 and see if that works.

1 Like

Indeed, this works! :pray:. lxc works again and all containers are back online.

I’ll see if I can remove a bunch of the snapshots.

1 Like

If you could send me your database dir i can try and reproduce this issue and resolve it with a big fix, as eventually this same patch will maje irs way into the LTS series too.

1 Like

Thanks @tomp! I have just sent you an e-mail. If you need any other information, please let me know.

I’ll stick to 5.21/stable for now, but I could probably test something if needed. This is a home server, so a bit of downtime is OK :sweat_smile:.

1 Like

Thanks!

I’ve got this PR in the works that should fix it.

Once its merged I’ll apply it to the 6/candidate branch and then you can try refreshing to that and see if it works. Then you can switch to 6/stable going forward if you would like (this is preferable to latest/stable, see LXD 6.2 has been released).

Thanks

Thanks for the quick turnaround time! I’ll keep an eye on the patch and will try to switch to 6/candidate when it’s included there.

Hi,

Please can you try 6/candidate channel now,

sudo snap refresh lxd --channel=6/candidate

Thanks
Tom

I have tried the upgrade to 6/candidate: 6.2-bde4d03 and that went well :+1:.

I noticed that I currently have (only?) 1268 snapshots. Not sure how this got pruned that much, though (from 2112 before). I had only removed a handful manually. However, most snapshots have a 14 day lifespan, so maybe that accounts for the difference (although it sounds a bit much).

In any case, the patch seems to work. Thanks a lot for your time :pray:!

1 Like

Cool thanks, we’ll push that out to 6/stable in the new year.

1 Like

LXD 6.2-bde4d03 includes the fix for the volatile.%.last_state.ip_addresses patch and is now in latest/stable and 6/stable.

1 Like