NOHZ tick-stop error (ZFS and Ubuntu Fan on Noble 24.04)

I just purchased a couple of used Lenovo P510’s to experiment with LXD. Unfortunately, once I create a cluster and launch a couple of containers, the workstations become utterly unusable. I was able to capture output from dmesg, and this is the (I think) problem logs (the “cut here” notation helps!). Hoping somebody can point me in the correct direction.

Latest BIOS from Lenovo (5/24/2022 IIRC), latest Ubuntu (24.04.1) and latest LXD snap (5.21.2). Kernel is 6.8.0-48. (Workstation locked up at that point – so I have like 30 seconds, tops.)

Vanilla Ubuntu ran all day, no issues. The snap install and ZFS didn’t seem to cause issues, but I didn’t let it sit. I think it was either the cluster join, or maybe launching the containers that initiated the bad behavior.

[26689.256997] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[26689.277888] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[26689.325701] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[26689.353495] ------------[ cut here ]------------
[26689.353499] Voluntary context switch within RCU read-side critical section!
[26689.353504] WARNING: CPU: 11 PID: 6392 at kernel/rcu/tree_plugin.h:320 rcu_note_context_switch+0x2ce/0x2f0
[26689.353511] Modules linked in: veth nft_masq nft_chain_nat vxlan ip6_udp_tunnel udp_tunnel dummy bridge stp llc nvme_tcp nvme_keyring nvme_fabrics nvme_core nvme_auth ebtable_filter ebtables ip6table_raw ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter nf_tables vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock zfs(PO) spl(O) qrtr cfg80211 binfmt_misc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common sb_edac nouveau snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp coretemp drm_gpuvm drm_exec snd_hda_intel gpu_sched snd_intel_dspcfg snd_intel_sdw_acpi drm_ttm_helper kvm_intel snd_hda_codec ttm snd_hda_core drm_display_helper snd_hwdep kvm cec snd_pcm ee1004 snd_timer rc_core nls_iso8859_1 irqbypass think_lmi i2c_algo_bit snd i2c_i801 rtsx_usb_ms mei_me rapl memstick firmware_attributes_class
[26689.353560]  intel_wmi_thunderbolt wmi_bmof intel_cstate mxm_wmi video mei intel_pch_thermal i2c_smbus soundcore lpc_ich input_leds mac_hid serio_raw sch_fq_codel dm_multipath msr efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 uas usb_storage rtsx_usb_sdmmc rtsx_usb crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 e1000e ahci xhci_pci libahci pata_acpi xhci_pci_renesas wmi aesni_intel crypto_simd cryptd
[26689.353593] CPU: 11 PID: 6392 Comm: systemd-resolve Tainted: P           O       6.8.0-48-generic #48-Ubuntu
[26689.353595] Hardware name: LENOVO 30B4S16S00/102F, BIOS S00KT73A 05/24/2022
[26689.353596] RIP: 0010:rcu_note_context_switch+0x2ce/0x2f0
[26689.353599] Code: fe ff ff ba 02 00 00 00 be 01 00 00 00 e8 fa d0 fe ff e9 6b fe ff ff 48 c7 c7 18 99 06 ac c6 05 ad 8e 61 02 01 e8 b2 12 f2 ff <0f> 0b e9 96 fd ff ff 0f 0b e9 36 ff ff ff 0f 0b e9 18 ff ff ff 66
[26689.353600] RSP: 0018:ffffa8f342ac7af0 EFLAGS: 00010046
[26689.353602] RAX: 0000000000000000 RBX: ffff8ecebfdb5a00 RCX: 0000000000000000
[26689.353604] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[26689.353605] RBP: ffffa8f342ac7b10 R08: 0000000000000000 R09: 0000000000000000
[26689.353605] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[26689.353606] R13: ffff8ebf80d48000 R14: 0000000000000000 R15: ffff8ebf8f4b1d80
[26689.353607] FS:  00007054bbc85940(0000) GS:ffff8ecebfd80000(0000) knlGS:0000000000000000
[26689.353609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[26689.353610] CR2: 0000619469f7cd90 CR3: 00000001a8048003 CR4: 00000000003706f0
[26689.353611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[26689.353612] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[26689.353613] Call Trace:
[26689.353615]  <TASK>
[26689.353618]  ? show_regs+0x6d/0x80
[26689.353621]  ? __warn+0x89/0x160
[26689.353626]  ? rcu_note_context_switch+0x2ce/0x2f0
[26689.353628]  ? report_bug+0x17e/0x1b0
[26689.353632]  ? handle_bug+0x51/0xa0
[26689.353635]  ? exc_invalid_op+0x18/0x80
[26689.353637]  ? asm_exc_invalid_op+0x1b/0x20
[26689.353641]  ? rcu_note_context_switch+0x2ce/0x2f0
[26689.353643]  __schedule+0x81/0x6b0
[26689.353648]  schedule+0x33/0x110
[26689.353650]  schedule_hrtimeout_range_clock+0x13a/0x150
[26689.353654]  schedule_hrtimeout_range+0x13/0x30
[26689.353657]  ep_poll+0x342/0x390
[26689.353663]  ? __pfx_ep_autoremove_wake_function+0x10/0x10
[26689.353666]  do_epoll_wait+0xdb/0x100
[26689.353668]  __x64_sys_epoll_wait+0x6f/0x110
[26689.353670]  x64_sys_call+0x18af/0x25c0
[26689.353673]  do_syscall_64+0x7f/0x180
[26689.353678]  ? __seccomp_filter+0x368/0x570
[26689.353682]  ? __task_pid_nr_ns+0x6c/0xc0
[26689.353686]  ? syscall_exit_to_user_mode+0x86/0x260
[26689.353689]  ? do_syscall_64+0x8c/0x180
[26689.353692]  ? syscall_exit_to_user_mode+0x86/0x260
[26689.353695]  ? do_syscall_64+0x8c/0x180
[26689.353697]  ? irqentry_exit_to_user_mode+0x7b/0x260
[26689.353698]  ? irqentry_exit+0x43/0x50
[26689.353699]  ? exc_page_fault+0x94/0x1b0
[26689.353702]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[26689.353704] RIP: 0033:0x7054bc25a007
[26689.353721] Code: 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb be 0f 1f 44 00 00 f3 0f 1e fa 80 3d 45 10 0e 00 00 41 89 ca 74 10 b8 e8 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 55 48 89 e5 48 83 ec 20 89 55 f8 48 89
[26689.353722] RSP: 002b:00007ffc9b18fba8 EFLAGS: 00000202 ORIG_RAX: 00000000000000e8
[26689.353725] RAX: ffffffffffffffda RBX: 000000000000002a RCX: 00007054bc25a007
[26689.353726] RDX: 000000000000002a RSI: 000061946ad25a50 RDI: 0000000000000004
[26689.353727] RBP: 00007ffc9b18fcc0 R08: 0000000000000000 R09: 0000000000000004
[26689.353728] R10: 00000000ffffffff R11: 0000000000000202 R12: 000061946ad25a50
[26689.353729] R13: 0000000000000015 R14: 000061946ad9c360 R15: ffffffffffffffff
[26689.353730]  </TASK>
[26689.353731] ---[ end trace 0000000000000000 ]---

Hi there

Please can you try without ZFS and also without using the Ubuntu Fan mode.

There are known issues with ZFS and Ubuntu Fan in the Ubuntu 6.8 kernel.

@amikhalitsyn does this look more like the ZFS or Ubuntu Fan issue?

it looks like a Ubuntu FAN one Bug #2064176 “LXD fan bridge causes blocked tasks” : Bugs : linux package : Ubuntu

Upd: you need kernel >= 6.8.0-50.50 to have this issue fixed

1 Like

But don’t use ZFS in production with Ubuntu Noble. It’s extremely unstable.

See:
https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2077926

1 Like

Thanks for the quick reply. I’ll try with btrfs, and no fan (have no clue if I created one or not).

Follow up questions:

  • Would I be better off using Ubuntu 22.04 or even jumping ahead to 24.10? This is a home project (I have admittedly weird hobbies :wink: ) … so I’m not tied to anything specifically.
  • I wanted to note that the storage drivers documentation seems to strongly suggest ZFS.

Thanks for the tickets to watch as well!
-Rob

1 Like

I’m pretty sure that 90% of users of this forum (including myself) share your hobby :wink:

yes, it will be better, but only if you won’t be using HWE kernel (which is the same as being shipped with Ubuntu Noble). As this is not an issue of Ubuntu Noble itself, but a kernel+ZFS incompatibility-related issue. As Ubuntu 22.04 uses old kernel 5.15 it has no that problem and works reliably with ZFS.

2 Likes

Thank you. I’ll try that as well!

1 Like

Yes Ubuntu 22.04 fan mode works fine too.

Yes we do suggest using ZFS, but that is with an expectation of kernel that has a stable ZFS implementation (which Ubuntu 22.04 has).

Cool. That seems to be the route I’m moving towards. Bridge doesn’t communicate between the servers. Fan does. Likely want to confine the default network tough… 16 million IP addrs is a bit unwieldy! (240.0.0.0/8)

I don’t have enough machines/network ports to get MicroOVN to work automagically (nor am I network savvy enough to have the patience to figure it out). :slight_smile:
-Rob

Thought I would offer an update.

I found that ZFS isn’t super stable/reliable for Ubuntu 22.04 (although it seems pretty good in 24.04). So I switched over to btrfs and turned off quotas (based on the LXD docs) since what I’m working with is VMs. As far as funky disk stuff, this seems to be stable - I haven’t seen anything resembling the unable to find zvol errors I was getting.

What I’m now finding is that the Ubuntu Fan seems to have a bunch of “hiccups”. For instance:

Error: Action Failed get_task: Task e0c8dfc6-a50e-4ad9-474e-f412627cded1 result: Preparing apply spec: Preparing package nats-v2-migrate: Fetching package blob: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: ‘bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get a630f86c-4678-4f3e-94de-2774ea4ca362 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get3073628143’, stdout: ‘Error running app - read tcp 240.4.0.20:48154->240.4.0.4:25250: read: connection reset by peer’, stderr: ‘’: exit status 1

I get random connection reset by peer messages. Note than on my single host LXD setup (ZFS, Ubuntu 24.04, just a network bridge), I don’t think I’ve ever seen that occur.

My only assumption is that this is something with the fan. Thoughts?

Current versions, and this is a two node cluster where hydra2 is identical:

$ uname -a
Linux hydra1 5.15.0-126-generic #136-Ubuntu SMP Wed Nov 6 10:38:22 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.5 LTS
Release:	22.04
Codename:	jammy
$ lxc version
Client version: 5.21.2 LTS
Server version: 5.21.2 LTS

Thanks!
-Rob

1 Like

There have been issues with Ubuntu fan in Noble but I understood they had been fixed now.

Suggest ensuring you have updated kernel, or try Ubuntu 22.04 and see if the issue is resolved there.

I think I figured it out. (This is all automated to some degree, so sometimes it’s a process of discovery…)

If I let the BOSH VM assign it’s own IP, the routing isn’t quite correct for the Ubuntu fan. However, since it’s a managed network, I can configure everything to work via DHCP (I “sneakily” just set the ipv4.address to what is expected). That 3rd route (begins 240.4.0.1) didn’t exist before being sneaky.

bosh/0:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc mq state UP group default qlen 1000
    link/ether 00:16:3e:38:51:cb brd ff:ff:ff:ff:ff:ff
    altname enp5s0
    inet 240.4.0.4/8 metric 1024 brd 240.255.255.255 scope global dynamic eth0
       valid_lft 3085sec preferred_lft 3085sec
bosh/0:~# ip route
default via 240.4.0.1 dev eth0 proto dhcp src 240.4.0.4 metric 1024 
240.0.0.0/8 dev eth0 proto kernel scope link src 240.4.0.4 metric 1024 
240.4.0.1 dev eth0 proto dhcp scope link src 240.4.0.4 metric 1024 

Thanks!
-Rob

1 Like