Enable low latency features in the generic Ubuntu kernel for 24.04

arighi · February 6, 2024, 5:19pm

Overview

As part of an ongoing investigation into making low latency capabilities available in the generic kernel for 24.04 [1] [2] we have conducted more tests, focusing at the potential overhead that could be introduced by switching from CONFIG_HZ=250 to CONFIG_HZ=1000, while also incorporating NO_HZ_FULL and RCU_LAZY into the analysis.

The primary goal is to enhance the versatility of the default Ubuntu kernel, by introducing additional boot-time and run-time options that will allow users to optimize their system for improved responsiveness, throughput, or power efficiency.

Quick recap of the features analyzed in this article (more details in [1]):

CONFIG_HZ=N: higher priority tasks have a chance to get the CPU N times a second (in general increasing N makes the system more responsive at the cost of throughput)
NO_HZ_FULL: shutdown ticks on certain target CPUs (configured at boot with nohz_full=cpus) when 0 or 1 task is running
rcu_nocb (boot time option): move RCU callbacks from softirq to kthread (reduce softirq execution)
RCU_LAZY (boot time option): batch RCU callbacks and flush them after a timed delay (reduce constant execution of RCU callbacks, saving power)

Test plan

The following tests have been conducted on a 8 cores Intel(R) Core™ i7-10510U CPU @ 1.80GHz system, with 16GB RAM, KXG60ZNV512G NVMe KIOXIA 512GB HD, installing the latest daily image of Ubuntu Noble.

The kernel is the latest linux-unstable 6.8.0-4 with and without the extra kernel config options mentioned in [1].

These are the different configurations that have been tested:

HZ=250 (the Ubuntu generic kernel, as it is right now)
HZ=1000 (generic kernel with HZ=1000 and the “lowlatency” features not active)
HZ=1000 nohz_full (generic kernel with HZ=1000, and nohz_full=all - tickless CPUs)
HZ=1000 lazy_rcu (generic kernel with HZ=1000, and rcu_nocbs=all + lazy RCU on)

All the tests have been executed stopping most of the services and setting the cpufreq governor to performance, to mitigate as much as possible any potential noise or interference with the tests.

These tests are focusing only on certain specific workloads, since more generic benchmarks have been covered already pretty well by the Phoronix benchmark results [2].

Theoretically we are expecting better responsiveness with HZ=1000, with a slightly reduced throughput respect to HZ=250, that could be compensated by enabling nohz_full capability.

In terms of power consumption we should expect a slightly improved energy saving with HZ=1000 (CPUs can react promptly to go idle) and an even better energy saving with rcu_nocbs + lazy RCUs enabled, at the cost of a reduced level of performance.

Results

The first set of tests is focusing on power consumption, both when the system is idle and busy (results measured with turbostat: average over a period of 5min).

Idle system

 - HZ=250:              2.02W
 - HZ=1000:             2.03W
 - HZ=1000 (nohz_full): 2.26W
 - HZ=1000 (lazy_RCU):  2.02W

It is interesting to notice the extra power consumption in the nohz_full case when the system is idle (because at least one CPU is always constantly ticking and it never goes idle, for timekeeping reasons).

For all the other cases, results are pretty much uniform, since the CPUs are just staying idle (not ticking), therefore HZ is irrelevant.

Busy system (doing I/O with fio)

 - HZ=250:              28.29W
 - HZ=1000:             26.84W
 - HZ=1000 (nohz_full): 26.83W
 - HZ=1000 (lazy_RCU):  25.44W

Results when the system is busy seem to prove the theory that increasing HZ can actually help to reduce power consumption (~5%, that seems a lot honestly and there might be some errors in the measurements to consider). Moreover, there’s an additional +~5% energy saving with lazy RCUs enabled, that could provide a total bonus of +~10% power saving, respect to the current generic kernel, that seems very interesting in the laptop/mobile scenario (but it is worth mentioning that even large cloud environments could benefit from this, assuming they can afford the performance penalty).

To measure the performance of a pure “CPU throughput” workload the stress-ng --matrix stressor has been used, measuring the bogo-ops/s.

NOTE: we should take these results with a little grain of salt, this is the average of 10 runs, but the standard deviation was pretty high, meaning that this test is really susceptible to small interference and the measuring error might be relevant:

Bogo/ops matrix stress-ng stressor

 - HZ=250:          	  17225.04 bogo-ops/s
 - HZ=1000:         	  16954.35 bogo-ops/s
 - HZ=1000 (nohz_full):   17502.80 bogo-ops/s
 - HZ=1000 (lazy_RCU):    16841.34 bogo-ops/s

In general this seems to confirm the theory that less ticks = better number crunching performance, but the goal here was to make sure that we didn’t have major performance regressions with HZ=1000, that seems to be the case.

Another metric that is worth considering is the iops, in particular WRITEs (mostly page cache activity, that means a lot of CPU, locking and synchronization).

A WRITE I/O intensive workload has been simulated using fio with multiple I/O sizes. Something interesting happened with nohz_full enabled: it seems that small I/O operations get a huge performance boost (+~39%), maybe small writes are susceptible to tick interference?

fio (short writes/re-writes)

 - HZ=250:          	  5418 iops/s
 - HZ=1000:         	  5593 iops/s
 - HZ=1000 (nohz_full):   7767 iops/s
 - HZ=1000 (lazy_RCU):    5541 iops/s

With large I/O operations nohz_full instead seems to perform much worse (-~50%, this may require further investigations):

fio (large writes/re-writes)

 - HZ=250:              2482 iops/s
 - HZ=1000:             2613 iops/s
 - HZ=1000 (nohz_full): 1334 iops/s
 - HZ=1000 (lazy_RCU):  2254 iops/s

There is also a reduced performance (as expected) with lazy RCU: -~13% with large I/O ops, but only <1% with small I/O ops, that is quite interesting.

Apart from the nohz_full oddity, everything else seems pretty smooth, no big issue or concern to report, and, as expected, there is a little improvement with HZ=1000 in terms of I/O performance in general.

One last metric to measure is the overhead of the tick interrupt itself, to make sure we don’t add any major overhead by enabling the extra config options.

For this, the following simple bpftrace script has been used, that measures how much time each hrtimer_interrupt invocation takes and, when it’s stopped, it prints the time distribution of the different invocations.

#!/usr/bin/bpftrace

kprobe:hrtimer_interrupt {
	@start[tid] = nsecs;
}

kretprobe:hrtimer_interrupt /@start[tid]/ {
	@elapsed = hist(nsecs - @start[tid]);
	delete(@start[tid]);
}

hrtimer_interrupt overhead (idle system)

 - HZ=250:
 
[2K, 4K)               8 |                                                    |
[4K, 8K)              67 |@@                                                  | 
[8K, 16K)           1446 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

 - HZ=1000:

[2K, 4K)             322 |@@@@@@@@@@@                                         |
[4K, 8K)             179 |@@@@@@                                              |
[8K, 16K)           1423 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

 - HZ=1000 (nohz full):

[4K, 8K)            4072 |@@@@                                                |
[8K, 16K)           1701 |@                                                   |
[16K, 32K)         52500 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)           543 |                                                    |

 - HZ=1000 (lazy RCU):

[8K, 16K)            193 |@@@@@@@@@                                           |
[16K, 32K)          1045 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)           812 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@            |

hrtimer_interrupt overhead (busy system, running fio)

- HZ=250:

[2K, 4K)      	117924 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)         	322 |                                                	|

 - HZ=1000:

[1K, 2K)       	88307 |@@@@@@@@@@@@                                    	|
[2K, 4K)      	380695 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)         	194 |                                                	|

 - HZ=1000 (nohz_full):

[512, 1K)        	317 |                                                	|
[1K, 2K)         	294 |                                                	|
[2K, 4K)       	59568 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)         	368 |                                                	|

 - HZ=1000 (lazy_RCU):

[2K, 4K)        	3094 |                                                	|
[4K, 8K)      	265763 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)     	182341 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             	|
[16K, 32K)      	3422 |                                                	|

Something interesting to notice here.

The nohz_full case clearly shows more invocations when the system is idle (due to the timekeeping CPU constantly ticking).
The tick handler seems to be more expensive in the lazy_RCU case, like 2x slower (maybe enabling lazy RCUs adds some extra logic to the tick handler? This is something that requires further investigation)
For all the other cases the time distribution seems to be pretty uniform.

Conclusion

Test results do not show any significant performance regression between HZ=1000 vs HZ=250.

Enabling the extra config options also doesn’t seem to introduce significant performance regression and it would provide users the flexibility to adjust the system at boot-time / run-time prioritizing 1) throughput, 2) responsiveness, or 3) power consumption, making the generic Ubuntu kernel even more “generic”.

There might be some special corner cases where these changes can cause performance regressions, but for the majority of the cases they can provide real performance benefits and a much greater flexibility.

Therefore, it seems reasonable to consider including these changes in the next Ubuntu kernel for the 24.04 release.

References

[1] Bug #2051342 “Enable lowlatency settings in the generic kernel” : Bugs : linux package : Ubuntu
[2] https://www.phoronix.com/news/Ubuntu-Generic-LL-Kernel

eeickmeyer · February 6, 2024, 6:44pm

We’ve been talking about this off/on for quite some time now about merging the features of the lowlatency kernel into the generic kernel, and as I’ve stated before, I’m all for it as it would reduce the burden on the kernel team so that you all would be maintaining one less kernel flavor. Despite what others may have said about me, I actually do care about you guys.

My biggest concern, of course, comes from the benefit for audio devices and the throughput for those. The goal, of course, is that audio devices that support it, which are usually professional audio interfaces such as those released by Steinberg, Presonus, Behringer, and other audio companies. Usually they are class-compliant devices, but sometimes they require a proprietary driver or dkms kernel module which understandably we wouldn’t support.

If these audio devices are PCI or PCIe devices, users have been known to get them down to 0 or 0.1 ms of round-trip latency with the lowlatency kernel, but usually higher latency with the generic kernel without incurring buffer overruns/underruns (xruns). A known limitation of USB devices is they have a minimum of 1ms of latency, though this may have changed with recent hardware. HDMI devices require a much, much higher latency due to a 4 MB required audio buffer, as I understand it.

One kernel that seems to achieve some very low latency that many users have suggested is the Liquorix Kernel (GitHub repo). Though we don’t support it on Ubuntu Studio, we have seen some positive results. I can’t make heads or tails of the config, but maybe there’s something useful in there.

Anyhow, just my thoughts. I hope this helps with something.

arighi · February 6, 2024, 6:56pm

Thanks @eeickmeyer , just to be clear, we are not planning to deprecate the lowlatency kernel for 24.04. For now the main goal of these changes is to make the generic kernel more flexible (aka more “generic”) and potentially make it usable also in certain “low latency” scenarios, considering that, with these changes applied, it would basically provide all the features that lowlatency provides.

I’ll take a look at the liquorix kernel, if there’s something interesting we may consider to investigate more / conduct some tests and possibly include it in our kernel as well (either lowlatency or even generic).

eeickmeyer · February 6, 2024, 7:05pm

Excellent!

Right, I wouldn’t think this at all. If you were, I’d have expected a lot more heads-up because I’d have some work to do in livecd-rootfs. Like, a lot of work. That said, that was never a concern.

This is wonderful, and definitely a welcome endeavour! We have the ubuntustudio-installer package to make Ubuntu Studio’s features installable on any flavor regardless of desktop environment. Having all or most of the lowlatency kernel’s features in the generic kernel is synergic in this endeavor.

Thank you! There must be something here as it’s generally praised throughout the Linux Audio and Gaming communities. I tried it out myself and it honestly saved my ability to do my workshop in Riga while I was preparing for it when the lowlatency kernel had a regression in 23.10 prior to release. Honestly, I was impressed with its performance on my laptop and it didn’t noticeably sacrifice much power, so there must be something to it.

arighi · March 7, 2024, 10:16am

Update: this proposal has been approved and applied to the latest 6.8 generic kernel for Ubuntu 24.04.

https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/noble/commit/?h=master-next&id=2396118b8bc59c35fb50839e8f190922251c3fad

corradoventu · March 7, 2024, 3:47pm

Well! now we need a manual or a wiki page to understand how to use these new features

arighi · March 7, 2024, 3:57pm

Working on a new post about that.

It would be also nice to have a user-space component that could help to automatically set these options based on certain generic profiles selected by the user (i.e., throughput, lowlatency, powersave, etc.). Just an idea to think about.