LXD does not start: "Failed loading instance config: instance_id"

elcaron · May 11, 2024, 5:49pm

Today out of the blue, I found that my LXD isn’t working anymore. It continuously restarts with the this log:

time="2024-05-11T17:29:12+02:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2024-05-11T17:29:14+02:00" level=error msg="Failed initializing network" err="Failed starting: Failed loading instance config: instance_id" network=lxdbr0 project=default
time="2024-05-11T17:29:14+02:00" level=error msg="Failed to start the daemon" err="Failed loading local instances: Failed loading instance config: instance_id"
time="2024-05-11T17:29:14+02:00" level=warning msg="Loading local instances from disk as database is not available" err="Failed loading instance config: instance_id"

Further:

# lxd --debug --group lxd
DEBUG  [2024-05-11T17:35:07+02:00] Connecting to a local LXD over a Unix socket 
DEBUG  [2024-05-11T17:35:07+02:00] Sending request to LXD                        etag= method=GET url="http://unix.socket/1.0"
INFO   [2024-05-11T17:35:07+02:00] LXD is starting                               mode=normal path=/var/snap/lxd/common/lxd version=5.21.1
INFO   [2024-05-11T17:35:07+02:00] Kernel uid/gid map:                          
INFO   [2024-05-11T17:35:07+02:00]  - u 0 0 4294967295                          
INFO   [2024-05-11T17:35:07+02:00]  - g 0 0 4294967295                          
INFO   [2024-05-11T17:35:07+02:00] Configured LXD uid/gid map:                  
INFO   [2024-05-11T17:35:07+02:00]  - u 0 1000000 1000000000                    
INFO   [2024-05-11T17:35:07+02:00]  - g 0 1000000 1000000000                    
INFO   [2024-05-11T17:35:07+02:00] Kernel features:                             
INFO   [2024-05-11T17:35:07+02:00]  - closing multiple file descriptors efficiently: yes 
INFO   [2024-05-11T17:35:07+02:00]  - netnsid-based network retrieval: yes      
INFO   [2024-05-11T17:35:07+02:00]  - pidfds: yes                               
INFO   [2024-05-11T17:35:07+02:00]  - core scheduling: yes                      
INFO   [2024-05-11T17:35:07+02:00]  - uevent injection: yes                     
INFO   [2024-05-11T17:35:07+02:00]  - seccomp listener: yes                     
INFO   [2024-05-11T17:35:07+02:00]  - seccomp listener continue syscalls: yes   
INFO   [2024-05-11T17:35:07+02:00]  - seccomp listener add file descriptors: yes 
INFO   [2024-05-11T17:35:07+02:00]  - attach to namespaces via pidfds: yes      
INFO   [2024-05-11T17:35:07+02:00]  - safe native terminal allocation : yes     
INFO   [2024-05-11T17:35:07+02:00]  - unprivileged file capabilities: yes       
INFO   [2024-05-11T17:35:07+02:00]  - cgroup layout: cgroup2                    
WARNING[2024-05-11T17:35:07+02:00]  - Couldn't find the CGroup hugetlb controller, hugepage limits will be ignored 
WARNING[2024-05-11T17:35:07+02:00]  - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead 
INFO   [2024-05-11T17:35:07+02:00]  - idmapped mounts kernel support: yes       
INFO   [2024-05-11T17:35:07+02:00] Instance type operational                     driver=lxc features="map[]" type=container
ERROR  [2024-05-11T17:35:08+02:00] Unable to run feature checks during QEMU initialization: Unable to locate the file for firmware "OVMF_CODE.4MB.fd" 
WARNING[2024-05-11T17:35:08+02:00] Instance type not operational                 driver=qemu err="QEMU failed to run feature checks" type=virtual-machine
INFO   [2024-05-11T17:35:08+02:00] Initializing local database                  
DEBUG  [2024-05-11T17:35:08+02:00] Refreshing identity cache with local trusted certificates 
INFO   [2024-05-11T17:35:08+02:00] Set client certificate to server certificate  fingerprint=f8e5967a4e24816664bcf6cbd25fea069c4b248cd38497967c4324bd10902424
DEBUG  [2024-05-11T17:35:08+02:00] Initializing database gateway                
INFO   [2024-05-11T17:35:08+02:00] Starting database node                        id=1 local=1 role=voter
INFO   [2024-05-11T17:35:08+02:00] Loading daemon configuration                 
DEBUG  [2024-05-11T17:35:08+02:00] Connecting to a local LXD over a Unix socket 
DEBUG  [2024-05-11T17:35:08+02:00] Sending request to LXD                        etag= method=GET url="http://unix.socket/1.0"
DEBUG  [2024-05-11T17:35:08+02:00] Detected stale unix socket, deleting         
INFO   [2024-05-11T17:35:08+02:00] Binding socket                                socket=/var/snap/lxd/common/lxd/unix.socket type="REST API Unix socket"
INFO   [2024-05-11T17:35:08+02:00] Binding socket                                socket=/var/snap/lxd/common/lxd/devlxd/sock type="devlxd socket"
INFO   [2024-05-11T17:35:08+02:00] Initializing global database                 
INFO   [2024-05-11T17:35:08+02:00] Connecting to global database                
DEBUG  [2024-05-11T17:35:08+02:00] Dqlite: attempt 1: server 1: connected       
INFO   [2024-05-11T17:35:08+02:00] Connected to global database                 
INFO   [2024-05-11T17:35:08+02:00] Initialized global database                  
DEBUG  [2024-05-11T17:35:08+02:00] Firewall detected "xtables" incompatibility: Backend command "iptables" is an nftables shim 
INFO   [2024-05-11T17:35:08+02:00] Firewall loaded driver                        driver=nftables
INFO   [2024-05-11T17:35:08+02:00] Initializing storage pools                   
DEBUG  [2024-05-11T17:35:08+02:00] Initializing storage pool                     pool=zfs
DEBUG  [2024-05-11T17:35:08+02:00] Mount started                                 driver=zfs pool=zfs
DEBUG  [2024-05-11T17:35:09+02:00] Mount finished                                driver=zfs pool=zfs
INFO   [2024-05-11T17:35:09+02:00] Initialized storage pool                      pool=zfs
INFO   [2024-05-11T17:35:09+02:00] All storage pools initialized                
INFO   [2024-05-11T17:35:09+02:00] Initializing daemon storage mounts           
INFO   [2024-05-11T17:35:09+02:00] Initializing networks                        
DEBUG  [2024-05-11T17:35:09+02:00] Start                                         driver=bridge network=lxdbr0 project=default
DEBUG  [2024-05-11T17:35:09+02:00] Setting up network                            driver=bridge network=lxdbr0 project=default
DEBUG  [2024-05-11T17:35:09+02:00] Stable MAC generated                          driver=bridge hwAddr="00:16:3e:b8:58:56" network=lxdbr0 project=default seed=f8e5967a4e24816664bcf6cbd25fea069c4b248cd38497967c4324bd10902424.0.1
DEBUG  [2024-05-11T17:35:09+02:00] Clearing firewall                             driver=bridge network=lxdbr0 project=default
DEBUG  [2024-05-11T17:35:09+02:00] Database error                                err="Failed loading instance config: instance_id"
ERROR  [2024-05-11T17:35:09+02:00] Failed initializing network                   err="Failed starting: Failed loading instance config: instance_id" network=lxdbr0 project=default
DEBUG  [2024-05-11T17:35:09+02:00] New operation                                 class=task description="Pruning leftover image files" operation=6b129ab7-0d58-418d-9704-925ca9942873 project=
DEBUG  [2024-05-11T17:35:09+02:00] Acquiring image task lock                    
DEBUG  [2024-05-11T17:35:09+02:00] Acquired image task lock                     
INFO   [2024-05-11T17:35:09+02:00] Cleaning up leftover image files             
DEBUG  [2024-05-11T17:35:09+02:00] Started operation                             class=task description="Pruning leftover image files" operation=6b129ab7-0d58-418d-9704-925ca9942873 project=
DEBUG  [2024-05-11T17:35:09+02:00] Success for operation                         class=task description="Pruning leftover image files" operation=6b129ab7-0d58-418d-9704-925ca9942873 project=
INFO   [2024-05-11T17:35:09+02:00] Done cleaning up leftover image files        
INFO   [2024-05-11T17:35:09+02:00] Starting device monitor                      
INFO   [2024-05-11T17:35:09+02:00] Initialized filesystem monitor                driver=fanotify path=/dev
DEBUG  [2024-05-11T17:35:09+02:00] Database error                                err="Failed loading instance config: instance_id"
ERROR  [2024-05-11T17:35:09+02:00] Failed to start the daemon                    err="Failed loading local instances: Failed loading instance config: instance_id"
INFO   [2024-05-11T17:35:09+02:00] Starting shutdown sequence                    signal=interrupt
DEBUG  [2024-05-11T17:35:09+02:00] Database error                                err="Failed loading instance config: instance_id"
WARNING[2024-05-11T17:35:09+02:00] Loading local instances from disk as database is not available  err="Failed loading instance config: instance_id"
DEBUG  [2024-05-11T17:35:09+02:00] Cancel ongoing or future gRPC connection attempts 
DEBUG  [2024-05-11T17:35:09+02:00] Cancel ongoing or future gRPC connection attempts 
INFO   [2024-05-11T17:35:09+02:00] Closing the database                         
INFO   [2024-05-11T17:35:09+02:00] Stop database gateway                        
INFO   [2024-05-11T17:35:09+02:00] Closing socket                                socket=/var/snap/lxd/common/lxd/unix.socket type="REST API Unix socket"
INFO   [2024-05-11T17:35:09+02:00] Closing socket                                socket=/var/snap/lxd/common/lxd/devlxd/sock type="devlxd socket"
INFO   [2024-05-11T17:35:09+02:00] Unmounting temporary filesystems             
INFO   [2024-05-11T17:35:09+02:00] Done unmounting temporary filesystems        
INFO   [2024-05-11T17:35:09+02:00] Daemon stopped                               
Error: Failed loading local instances: Failed loading instance config: instance_id

I tried to do the update from Ubuntu 22.04 LTS to Ubuntu 24,04 LTS (which I am now using) in the hope that it would go away, but it did not.

I further tried to restore “/var/snap/lxd/common/lxd/database/global/” from an older snapshot to no avail.

I am unsure how to proceed from here, I fail to find anything on the “Failed loading instance config”.
I do have some ZFS snapshots from both the containers and the snap directory.

Disclosure: Crosspost with https://new.reddit.com/r/LXD/comments/1cpl333/lxd_does_not_start_failed_loading_instance_config/ as I would really like to get my home server running again …

If a solution comes up on the other side, I will include it here.

elcaron · May 11, 2024, 7:12pm

Here is journalctl -f during the start:

Mai 11 21:04:35 ae35 systemd[1]: Started snap.lxd.lxd-6600db9c-b3cd-4a17-b7d5-7af87361cac7.scope.
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7207] manager: (lxdbr0): new Bridge device (/org/freedesktop/NetworkManager/Devices/394)
Mai 11 21:04:37 ae35 avahi-daemon[3498]: Joining mDNS multicast group on interface lxdbr0.IPv4 with address 10.103.164.1.
Mai 11 21:04:37 ae35 avahi-daemon[3498]: New relevant interface lxdbr0.IPv4 for mDNS.
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7423] device (lxdbr0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Mai 11 21:04:37 ae35 avahi-daemon[3498]: Registering new address record for 10.103.164.1 on lxdbr0.IPv4.
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7450] device (lxdbr0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'external')
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7454] device (lxdbr0): Activation: starting connection 'lxdbr0' (68e34de5-7421-4ce8-8e8b-ef2e47ee3d29)
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7455] device (lxdbr0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7456] device (lxdbr0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7457] device (lxdbr0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7458] device (lxdbr0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7470] device (lxdbr0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7471] device (lxdbr0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.7473] device (lxdbr0): Activation: successful, device activated.
Mai 11 21:04:37 ae35 avahi-daemon[3498]: Joining mDNS multicast group on interface lxdbr0.IPv6 with address fd42:9d74:9aa4:702a::1.
Mai 11 21:04:37 ae35 avahi-daemon[3498]: New relevant interface lxdbr0.IPv6 for mDNS.
Mai 11 21:04:37 ae35 avahi-daemon[3498]: Registering new address record for fd42:9d74:9aa4:702a::1 on lxdbr0.*.
Mai 11 21:04:37 ae35 kernel: audit: type=1400 audit(1715454277.758:571): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="lxd_dnsmasq-lxdbr0_</var/snap/lxd/common/lxd>" pid=197993 comm="apparmor_parser"
Mai 11 21:04:37 ae35 avahi-daemon[3498]: Interface lxdbr0.IPv6 no longer relevant for mDNS.
Mai 11 21:04:37 ae35 avahi-daemon[3498]: Leaving mDNS multicast group on interface lxdbr0.IPv6 with address fd42:9d74:9aa4:702a::1.
Mai 11 21:04:37 ae35 avahi-daemon[3498]: Interface lxdbr0.IPv4 no longer relevant for mDNS.
Mai 11 21:04:37 ae35 avahi-daemon[3498]: Leaving mDNS multicast group on interface lxdbr0.IPv4 with address 10.103.164.1.
Mai 11 21:04:37 ae35 avahi-daemon[3498]: Withdrawing address record for fd42:9d74:9aa4:702a::1 on lxdbr0.
Mai 11 21:04:37 ae35 avahi-daemon[3498]: Withdrawing address record for 10.103.164.1 on lxdbr0.
Mai 11 21:04:37 ae35 NetworkManager[3634]: <info>  [1715454277.9010] device (lxdbr0): state change: activated -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed')
Mai 11 21:04:38 ae35 systemd[1]: snap.lxd.lxd-6600db9c-b3cd-4a17-b7d5-7af87361cac7.scope: Deactivated successfully.
Mai 11 21:04:38 ae35 systemd[1]: snap.lxd.lxd-6600db9c-b3cd-4a17-b7d5-7af87361cac7.scope: Consumed 2.402s CPU time.
Mai 11 21:04:44 ae35 systemd[1]: zsysd.service: Deactivated successfully.
Mai 11 21:04:44 ae35 systemd[1]: zsysd.service: Consumed 28.830s CPU time.

Could this be an apparmor issue?

tomp · May 12, 2024, 6:16am

Hi

This sounds similar to https://discuss.linuxcontainers.org/t/cannot-list-instances-failed-to-fetch-instances-id/9377

Can you stop lxd uding snap stop lxd

And then tar up

/var/snap/lxd/common/lxd/database/global

And send to me via a private message please.

elcaron · May 12, 2024, 7:50am

Since I needed to get the system running again, I opted for the nuclear option and reinstalled lxd (without the --purge switch).

Maybe somebody can review I created any upcoming problems for me.

Removed lxd snap
Installed lxd snap
Ran lxd init
Did NOT create storage pool
DID create a bridge. This worked. In contrast to before, I did supply an IPv4 net instead of auto, no idea if that mattered. I have seen issues were IP detection failed here
Ran lxd recover
Was told that I had snapshot inconsistencies. Maybe I accidentally deleted zfs snapshots in the past? Maybe sanoid/syncoid caused issues?
Looked into the backup.yaml and tried to recreate the expected @snapN snapshots, still got the error.
Removed all snapshots and the 2 snapshot sections from the backup.yaml files.
Successfully recovered storage pool, containers run now.

tomp · May 13, 2024, 5:16pm

Sounds good, glad you got it back up and running.
Using lxd recover tool was a good approach.

elcaron · May 13, 2024, 6:33pm

It would be great if recover had some healing power. I think it wouldn’t be hard to give it the option to adjust the containers to match the actual snapshot configuration.

sdeziel1 · May 30, 2024, 6:45pm

We plan on improving the lxd revover experience, see `lxd recover` should better handle containers with `security.idmap.isolated=true` · Issue #13463 · canonical/lxd · GitHub and `lxd recover` loses the expiration dates of snapshots. · Issue #13464 · canonical/lxd · GitHub