Zfs block storage volume volblocksize not set, but only on one cluster member?

I have a weird case where blocksize is not set when creating a container. I have a cluster with a “docker” storage pool defined as

~$ ssh lxd12 lxc storage show docker
config:
volume.block.filesystem: ext4
volume.size: 15GiB
volume.zfs.block_mode: "true"
volume.zfs.blocksize: 128KiB
volume.zfs.use_refquota: "true"
description: ""
name: docker
driver: zfs
used_by:

status: Created
locations:
- lxd12
- lxd13

so far so good, if i create a container on lxd12 i get:

~$ lxc launch u22 test1 --target lxd12 --storage docker
Creating test1
Starting test1
~$ ssh lxd12 zfs get volblocksize lxd12/docker/containers/test1
NAME                           PROPERTY      VALUE     SOURCE
lxd12/docker/containers/test1  volblocksize  128K      -

all good, but if I do the same on lxd13, I get

~$ lxc launch u22 test1 --target lxd13 --storage docker
Creating test1
Starting test1
~$ ssh lxd13 zfs get volblocksize lxd13/docker/containers/test1
NAME                           PROPERTY      VALUE     SOURCE
lxd13/docker/containers/test1  volblocksize  8K        default

any ideas why volblocksize is not set only on this one cluster member?

hm, my primary driver here was hope that with larger block size I will be able to achieve better compression ratios, but even with 128K volblocksize, compression ratios are waay worse than with regular datasets, so even if I figure it out it won’t make much of a difference

it would still be great to figure out why this single cluster member is not setting the configured blocksize, whereas 6 others are perfectly fine.

@dinmusic is this describing the same thing as https://github.com/canonical/lxd/pull/12089#issuecomment-1727627850 ?

@Aleks

Is storage docker fresh, or was it used before for any other containers?

It could be that an optimized image was created on the other cluster member before the zfs.blocksize was set. This would cause an older image (with old block size) to be used.

Edit:
Can you try creating one container with an image that you have not used previously?
If an optimized image does not exist for the given image, the block size should be set correctly.

For example:

lxc launch images:alpine/edge/cloud test123 --target lxd13 --storage docker

Indeed

$ lxc launch images:alpine/edge/cloud test123 --target lxd13 --storage dockerCreating test123
Starting test123
$ ssh lxd13 zfs get volblocksize lxd13/docker/containers/test123
NAME                             PROPERTY      VALUE     SOURCE
lxd13/docker/containers/test123  volblocksize  128K      -

How do I regenerate the optimized image?

@dinmusic lxd should detect that and do it on demand right, is that the bug you mentioned?

@tomp Yes, this is the bug we encountered with initial configuration. When deciding whether a new optimized image should be generated, LXD checks if the block mode and filesystem of a new image (default storage pool settings) match with the existing optimized image. Therefore, an existing optimized image will be used as long as block mode and filesystem match.

I tried fixing this by introducing a check for zfs.blocksize. However, if only block size differs, the image is moved to deleted images (because existing instances depend on it) and then immediately restored from deleted images because the fingerprints of both “deleted” and “new” image match.

Currently, the only solution I see, is to either extend image fingerprints with the block size (similar to our approach for different filesystem types) or to avoid using an optimized image if block size of the new image differs from the block size of an existing optimized image.

@Aleks unfortunately, I cannot think of any elegant workaround right now. To regenerate the optimized image with incorrect block size, you’d need to first remove all instances that depend on this image and then remove the image itself. Finally, a new image that will be created when starting an instance should have the correct block size.

hm, how does an instance depend on an image, the image is used only once, on creation?

I have just recreated the image so it got another fingerprint and the blocksize is correct now. No idea how to do it if you want to keep the image exactly the same, I guess exporting, removing them locally and reimporting should suffice.

An optimized image is effectively a read-only “snapshot” of the image that is cloned when a new instance is created (if it fits for the new instance).

A clone (instance’s volume), is a writable volume created from a snapshot. The clone initially shares all its data blocks with the snapshot it was created from and consumes virtually no extra space. As changes are made to the clone, it’ll start to diverge from the snapshot, and new blocks will be allocated exclusively for the clone.

Now, an optimized image contains references to blocks that the clone might still be using. If you were to delete the snapshot, you’d potentially remove blocks that the clone hasn’t overridden yet and thus still relies on. This is why ZFS prevents the deletion of snapshots with dependent clones.

To demonstrate:

$ lxc storage create demo zfs
$ lxc launch ubuntu:22.04 c1 --storage demo --no-profiles
$ zfs list -r demo
NAME                                                                           USED  AVAIL     REFER  MOUNTPOINT
demo/containers/c1                                                            8.10M  28.0G      633M  legacy
demo/images/be57f822968b4f2831627e74590f887d5945cc7426361780fb3958327a6706be   631M  28.0G      631M  legacy
...

Note that container c1 uses only ~8MB, but references ~600MB of data.
We can try to remove this image, but ZFS won’t allow that because image has dependent clones:

$ zfs destroy -r demo/images/be57f822968b4f2831627e74590f887d5945cc7426361780fb3958327a6706be
cannot destroy 'demo/images/be57f822968b4f2831627e74590f887d5945cc7426361780fb3958327a6706be': filesystem has dependent clones
use '-R' to destroy the following datasets:
demo/containers/c1

ok, thats good to know, but it does not really matter much as long as you stick to high level lxc commands and not mess around directly with zfs. the image dataset will be moved to deleted/ and happily stay there until the last container depending on it dies or goes away

for example

~$ lxc launch 9257ffbcb498 test123 --target lxd13 --storage docker
Creating test123
Starting test123                           
~$ lxc image delete 9257ffbcb498
~$ ssh lxd13
[root@lxd13 ~]# zfs list |fgrep test123
lxd13/docker/containers/test123                                                                     166M   578G      983M  -
[root@lxd13 ~]# zfs list |fgrep 9257ffbcb498

lxd13/docker/deleted/images/9257ffbcb498400dc7a5060f75c7159446bb6f4e40afba87e8976322e9e4a8a0_ext4 825M 578G 825M -

~$ lxc stop test123
~$ lxc mv test123 --target lxd12
~$ ssh lxd12
[root@lxd12 ~]# zfs list | fgrep test123
lxd12/docker/containers/test123                                                             999M   573G      999M  -
[root@lxd12 ~]# zfs list | fgrep 9257ffbcb498

Correct.

And because removed image remains in deleted/images until last instance depending on it is deleted, when a new instance that depends on the image with the same fingerprint is created, will cause the removed image to be restored from deleted/images back to images.

The calculated fingerprint is unique for each available image. However, for the same image, fingerprint differs only when block mode or filesystem is changed. As a result, when zfs.blocksize is changed, the fingerprint remains the same, causing an old image to be used (instead of generating a new one).

Therefore, until we solve this bug, the only way to truly regenerate an existing optimized image is to remove all instances that depend on it and remove the image itself.

https://github.com/canonical/lxd/issues/12317