Random "Failed to activate volume: Failed to locate zvol for ..."

Hello all!

I figure I’m doing something wrong (or incomplete) but I’m not certain what’s going on. Figured I might as well ask.

I have the LXD snap setup and when I did the init, I used an existing zpool (mirrored). So the default pool itself is ZFS on partitions on a couple of drives.

The error is occurring with my BOSH CPI. It was working, but I was having issues with the image being copied every time (never was cached). Finally realized (I think) it was due to the fact I specified the size of the root disk. When I stopped doing that, the image copy was instantaneous (and I never saw the QCOW2 file being copied). This is not immediately pertinent to the question, but it could be that it was hiding this issue (by slowing VM creation down).

Posting the larger logs from the BOSH create going on. In this particular stage, all the packages are being compiled, and 5 VMs are being spun up. 4 of them are fine and the package gets compiled. The 5th one (uncertain which one since it’s not clear) never gets created – after the VM gets created (the state is stopped) an ephemeral disk of 10GiB is created. 4 of those disks gets created, but the 5th one boshdev_vol-e-62116b9e-92a6-41ad-4059-65d083399ed7 fails with the Failed to activate volume: Failed to locate zvol error:

Task 9 | 17:51:19 | Preparing deployment: Preparing deployment (00:00:02)
Task 9 | 17:51:21 | Preparing deployment: Rendering templates (00:00:01)
Task 9 | 17:51:22 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 9 | 17:51:22 | Compiling packages: golang-1-linux/2721afc7ec762ad6364f4910407eb0dda5541f1288c9fccae9c75d33d7e07aff
Task 9 | 17:51:22 | Compiling packages: postgres-yq-4/dc26c48bc847408082bf497930f6ba6775935da6c143b2016bdfe3b8ba5b554c
Task 9 | 17:51:22 | Compiling packages: postgres-13/4534645edb722f4f365c67e374063bdd59c210af9eba5f961b9ad55ab1db84d1
Task 9 | 17:51:22 | Compiling packages: postgres-15/3fe5c61e6b6779ce8a2c127047e2572c200f765afce0df76510aebce15aff61b
Task 9 | 17:51:22 | Compiling packages: postgres-16/eefdc0f0bb8341f74113e36ba4604b46ca67cf54a82e57cb901470be38ba46e1
Task 9 | 17:51:55 | Compiling packages: golang-1-linux/2721afc7ec762ad6364f4910407eb0dda5541f1288c9fccae9c75d33d7e07aff (00:00:33)
                  L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Create ephemeral disk: Failed to activate volume: Failed to locate zvol for "lxd-storage/custom/boshdev_vol-e-62116b9e-92a6-41ad-4059-65d083399ed7": context deadline exceeded' in 'create_vm' CPI method (CPI request ID: 'cpi-901775')
Task 9 | 17:52:16 | Compiling packages: postgres-yq-4/dc26c48bc847408082bf497930f6ba6775935da6c143b2016bdfe3b8ba5b554c (00:00:54)
Task 9 | 17:56:29 | Compiling packages: postgres-13/4534645edb722f4f365c67e374063bdd59c210af9eba5f961b9ad55ab1db84d1 (00:05:07)
Task 9 | 17:56:42 | Compiling packages: postgres-15/3fe5c61e6b6779ce8a2c127047e2572c200f765afce0df76510aebce15aff61b (00:05:20)
Task 9 | 17:56:53 | Compiling packages: postgres-16/eefdc0f0bb8341f74113e36ba4604b46ca67cf54a82e57cb901470be38ba46e1 (00:05:31)
Task 9 | 17:56:56 | Error: CPI error 'Bosh::Clouds::CloudError' with message 'Create ephemeral disk: Failed to activate volume: Failed to locate zvol for "lxd-storage/custom/boshdev_vol-e-62116b9e-92a6-41ad-4059-65d083399ed7": context deadline exceeded' in 'create_vm' CPI method (CPI request ID: 'cpi-901775')

What’s confusing here is that it works for 4 of 5 disks. Fortunately, the Go code is pretty much just LXD API calls:

func (c CPI) createDisk(size int, name string) error {
	storageVolumeRequest := api.StorageVolumesPost{
		Name:        name,
		Type:        "custom",
		ContentType: "block",
		StorageVolumePut: api.StorageVolumePut{
			Config: map[string]string{
				"size": fmt.Sprintf("%dMiB", size),
			},
		},
	}

	return c.client.CreateStoragePoolVolume(c.config.Server.StoragePool, storageVolumeRequest)
}

Where, in the problem case, size=10240 and name=vol-e-62116b9e-92a6-41ad-4059-65d083399ed7 (e=ephemeral and that is a uuid at the end). … just like the other four.

I can’t find any errors in the Zpool:

$ zpool status
  pool: lxd-storage
 state: ONLINE
config:

	NAME         STATE     READ WRITE CKSUM
	lxd-storage  ONLINE       0     0     0
	  mirror-0   ONLINE       0     0     0
	    sdb1     ONLINE       0     0     0
	    sdc1     ONLINE       0     0     0

errors: No known data errors

When I was watching load, it was not very high (2, I think?).

I found nothing in dmesg. Uncertain where the ZFS logs are (if there are anything specific).

Any thoughts on what I should be looking at? I am assuming this isn’t really LXD.

Thanks!
-Rob

Hello,

Which LXD version (and snap revision) is being used? This has been addressed quite recently, where udev rules were not applied quickly enough, causing the device to not be detected. Now, it tries to locate the device periodically up to 30 seconds. Therefore, I’m wondering whether this is happening on LXD prior the fix, or is an edge case that was not covered.

I think it’s current?

$ sudo snap list
Name     Version         Rev    Tracking       Publisher   Notes
core20   20240416        2318   latest/stable  canonical✓  base
core22   20240823        1612   latest/stable  canonical✓  base
juju-db  4.4.24          178    4.4/stable     juju-qa     -
lxd      5.21.2-2f4ba6b  30131  5.21/stable    canonical✓  -
snapd    2.63            21759  latest/stable  canonical✓  snapd

(Not using juju-db – I just can’t get it to go away. lol!)

Just in case this is useful. I create an ISO for the configuration. Sometimes that doesn’t work either. Assuming it’s the same thing.

Task 4 | 18:44:02 | Compiling packages: golang-1-linux/2721afc7ec762ad6364f4910407eb0dda5541f1288c9fccae9c75d33d7e07aff
Task 4 | 18:44:02 | Compiling packages: postgres-13/4534645edb722f4f365c67e374063bdd59c210af9eba5f961b9ad55ab1db84d1
Task 4 | 18:44:02 | Compiling packages: postgres-16/eefdc0f0bb8341f74113e36ba4604b46ca67cf54a82e57cb901470be38ba46e1
Task 4 | 18:44:02 | Compiling packages: postgres-15/3fe5c61e6b6779ce8a2c127047e2572c200f765afce0df76510aebce15aff61b
Task 4 | 18:44:02 | Compiling packages: postgres-yq-4/dc26c48bc847408082bf497930f6ba6775935da6c143b2016bdfe3b8ba5b554c
Task 4 | 18:44:08 | Compiling packages: postgres-16/eefdc0f0bb8341f74113e36ba4604b46ca67cf54a82e57cb901470be38ba46e1 (00:00:06)
                  L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Write AgentEnv: writeAgentFileToVm(vm-80b298c2-4ef3-499e-61bf-03e12c9480fd) - Create: Wait: Failed creating custom volume from ISO: Failed creating volume: open /dev/zvol/lxd-storage/custom/boshdev_vol-c-cf32bd71-72c4-4e9a-5733-7a089cc34fca.iso: no such device or address' in 'create_vm' CPI method (CPI request ID: 'cpi-990906')
Task 4 | 18:44:36 | Compiling packages: golang-1-linux/2721afc7ec762ad6364f4910407eb0dda5541f1288c9fccae9c75d33d7e07aff (00:00:34)
                  L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Create ephemeral disk: Failed to activate volume: Failed to locate zvol for "lxd-storage/custom/boshdev_vol-e-718f9e86-61a3-4d93-7e8d-65df0dc16916": context deadline exceeded' in 'create_vm' CPI method (CPI request ID: 'cpi-437471')
Task 4 | 18:44:58 | Compiling packages: postgres-yq-4/dc26c48bc847408082bf497930f6ba6775935da6c143b2016bdfe3b8ba5b554c (00:00:56)
Task 4 | 18:49:10 | Compiling packages: postgres-13/4534645edb722f4f365c67e374063bdd59c210af9eba5f961b9ad55ab1db84d1 (00:05:08)
Task 4 | 18:49:31 | Compiling packages: postgres-15/3fe5c61e6b6779ce8a2c127047e2572c200f765afce0df76510aebce15aff61b (00:05:29)
Task 4 | 18:49:33 | Error: CPI error 'Bosh::Clouds::CloudError' with message 'Write AgentEnv: writeAgentFileToVm(vm-80b298c2-4ef3-499e-61bf-03e12c9480fd) - Create: Wait: Failed creating custom volume from ISO: Failed creating volume: open /dev/zvol/lxd-storage/custom/boshdev_vol-c-cf32bd71-72c4-4e9a-5733-7a089cc34fca.iso: no such device or address' in 'create_vm' CPI method (CPI request ID: 'cpi-990906')

For easier reading, those errors are:

  • Failed creating custom volume from ISO: Failed creating volume: open /dev/zvol/lxd-storage/custom/boshdev_vol-c-cf32bd71-72c4-4e9a-5733-7a089cc34fca.iso: no such device or address
  • Failed to activate volume: Failed to locate zvol for “lxd-storage/custom/boshdev_vol-e-718f9e86-61a3-4d93-7e8d-65df0dc16916”: context deadline exceeded

That’s coming from this snippet:

	buf := bytes.NewBuffer(diskImage)
	err = wait(c.client.CreateStoragePoolVolumeFromISO(c.config.Server.StoragePool, lxdclient.StoragePoolVolumeBackupArgs{
		Name:       diskName,
		BackupFile: buf,
	}))
	if err != nil {
		return bosherr.WrapErrorf(err, "writeAgentFileToVm(%s) - Create", vmCID.AsString())
	}

If I have 5 VMs being created with 10GiB ephemeral disk each, could I just be doing too much I/O (50GiB) in a short amount of time? I was thinking maybe I write a wrapper around it do retry after a short duration. I don’t have much control over what is being done, but I can try to prevent problematic conditions.
-Rob

Sadly, I get it at the LXC command line as well:

$ lxc storage volume create default testingdisk1 --type=block size=10240MiB
Error: Failed to activate volume: Failed to locate zvol for "lxd-storage/custom/boshdev_testingdisk1": context deadline exceeded

This is my desktop, also installed from the snap:

$ sudo snap list lxd
Name  Version         Rev    Tracking     Publisher   Notes
lxd   5.21.2-2f4ba6b  30131  5.21/stable  canonical✓  -
$ lxc version
Client version: 5.21.2 LTS
Server version: 5.21.2 LTS

Obviously I have something tied up in knots. I’ll try rebooting and see if that helps (not a resolution but at least indicates it’s transitory).
-Rob

After the reboot, I also realized I may want to update all my Go libraries. There were updates for the LXD API, so everything should be current.

I was able to deploy the BOSH VM (hasn’t been an issue), Postgres (1 VM with 1 error that went away on a 2nd try), Concourse (4 VMs with no errors), and finally Cloud Foundry (15 VMs with no errors and no timeouts). So really good!

For the record, the error I did get was related to the ISO configuration file (note that this is done for every single VM, so it was definitely transitory):

  • Error: CPI error ‘Bosh::Clouds::CloudError’ with message ‘Write AgentEnv: writeAgentFileToVm(vm-deadc240-6a3b-4725-5488-0d48f291f948) - Create: Wait: Failed creating custom volume from ISO: Failed creating volume: open /dev/zvol/lxd-storage/custom/boshdev_vol-c-e15a5b20-ea0b-4ad9-7a62-f9d9a3301838.iso: no such device or address’ in ‘create_vm’ CPI method (CPI request ID: ‘cpi-211464’)

Pulling off all the wrapper text, that gets us to:

  • Create: Wait: Failed creating custom volume from ISO: Failed creating volume: open /dev/zvol/lxd-storage/custom/boshdev_vol-c-e15a5b20-ea0b-4ad9-7a62-f9d9a3301838.iso: no such device or address (related code is here)

… which I think is the same thing. Uncertain if there is anything to do, since it’s so transitory. Wondering if there is anything I should look at or any logs? I can tear down the dev environment and rebuild without any issues (and wander away due to how long it takes!).

The repo is here, in case someone wants to poke around: GitHub - a2geek/bosh-lxd-cpi-release: A BOSH CPI release for LXD
The Go code is in src/ and the actual LXD interactions will be in src/cpi/ … mostly named after the BOSH API’s that are implemented - the obvious action related code (create_vm, create_disk, etc).

Thanks!
-Rob