Overview
As part of an ongoing investigation into making low latency capabilities available in the generic kernel for 24.04 [1] [2] we have conducted more tests, focusing at the potential overhead that could be introduced by switching from CONFIG_HZ=250
to CONFIG_HZ=1000
, while also incorporating NO_HZ_FULL
and RCU_LAZY
into the analysis.
The primary goal is to enhance the versatility of the default Ubuntu kernel, by introducing additional boot-time and run-time options that will allow users to optimize their system for improved responsiveness, throughput, or power efficiency.
Quick recap of the features analyzed in this article (more details in [1]):
-
CONFIG_HZ=N
: higher priority tasks have a chance to get the CPU N times a second (in general increasing N makes the system more responsive at the cost of throughput) -
NO_HZ_FULL
: shutdown ticks on certain target CPUs (configured at boot with nohz_full=cpus) when 0 or 1 task is running -
rcu_nocb
(boot time option): move RCU callbacks from softirq to kthread (reduce softirq execution) -
RCU_LAZY
(boot time option): batch RCU callbacks and flush them after a timed delay (reduce constant execution of RCU callbacks, saving power)
Test plan
The following tests have been conducted on a 8 cores Intel(R) Core™ i7-10510U CPU @ 1.80GHz system, with 16GB RAM, KXG60ZNV512G NVMe KIOXIA 512GB HD, installing the latest daily image of Ubuntu Noble.
The kernel is the latest linux-unstable 6.8.0-4 with and without the extra kernel config options mentioned in [1].
These are the different configurations that have been tested:
HZ=250
(the Ubuntu generic kernel, as it is right now)HZ=1000
(generic kernel with HZ=1000 and the “lowlatency” features not active)HZ=1000 nohz_full
(generic kernel with HZ=1000, and nohz_full=all - tickless CPUs)HZ=1000 lazy_rcu
(generic kernel with HZ=1000, and rcu_nocbs=all + lazy RCU on)
All the tests have been executed stopping most of the services and setting the cpufreq governor to performance, to mitigate as much as possible any potential noise or interference with the tests.
These tests are focusing only on certain specific workloads, since more generic benchmarks have been covered already pretty well by the Phoronix benchmark results [2].
Theoretically we are expecting better responsiveness with HZ=1000, with a slightly reduced throughput respect to HZ=250, that could be compensated by enabling nohz_full capability.
In terms of power consumption we should expect a slightly improved energy saving with HZ=1000 (CPUs can react promptly to go idle) and an even better energy saving with rcu_nocbs + lazy RCUs enabled, at the cost of a reduced level of performance.
Results
The first set of tests is focusing on power consumption, both when the system is idle and busy (results measured with turbostat
: average over a period of 5min).
Idle system
- HZ=250: 2.02W
- HZ=1000: 2.03W
- HZ=1000 (nohz_full): 2.26W
- HZ=1000 (lazy_RCU): 2.02W
It is interesting to notice the extra power consumption in the nohz_full case when the system is idle (because at least one CPU is always constantly ticking and it never goes idle, for timekeeping reasons).
For all the other cases, results are pretty much uniform, since the CPUs are just staying idle (not ticking), therefore HZ is irrelevant.
Busy system (doing I/O with fio)
- HZ=250: 28.29W
- HZ=1000: 26.84W
- HZ=1000 (nohz_full): 26.83W
- HZ=1000 (lazy_RCU): 25.44W
Results when the system is busy seem to prove the theory that increasing HZ can actually help to reduce power consumption (~5%, that seems a lot honestly and there might be some errors in the measurements to consider). Moreover, there’s an additional +~5% energy saving with lazy RCUs enabled, that could provide a total bonus of +~10% power saving, respect to the current generic kernel, that seems very interesting in the laptop/mobile scenario (but it is worth mentioning that even large cloud environments could benefit from this, assuming they can afford the performance penalty).
To measure the performance of a pure “CPU throughput” workload the stress-ng --matrix
stressor has been used, measuring the bogo-ops/s.
NOTE: we should take these results with a little grain of salt, this is the average of 10 runs, but the standard deviation was pretty high, meaning that this test is really susceptible to small interference and the measuring error might be relevant:
Bogo/ops matrix stress-ng stressor
- HZ=250: 17225.04 bogo-ops/s
- HZ=1000: 16954.35 bogo-ops/s
- HZ=1000 (nohz_full): 17502.80 bogo-ops/s
- HZ=1000 (lazy_RCU): 16841.34 bogo-ops/s
In general this seems to confirm the theory that less ticks = better number crunching performance, but the goal here was to make sure that we didn’t have major performance regressions with HZ=1000, that seems to be the case.
Another metric that is worth considering is the iops, in particular WRITEs (mostly page cache activity, that means a lot of CPU, locking and synchronization).
A WRITE I/O intensive workload has been simulated using fio with multiple I/O sizes. Something interesting happened with nohz_full enabled: it seems that small I/O operations get a huge performance boost (+~39%), maybe small writes are susceptible to tick interference?
fio (short writes/re-writes)
- HZ=250: 5418 iops/s
- HZ=1000: 5593 iops/s
- HZ=1000 (nohz_full): 7767 iops/s
- HZ=1000 (lazy_RCU): 5541 iops/s
With large I/O operations nohz_full instead seems to perform much worse (-~50%, this may require further investigations):
fio (large writes/re-writes)
- HZ=250: 2482 iops/s
- HZ=1000: 2613 iops/s
- HZ=1000 (nohz_full): 1334 iops/s
- HZ=1000 (lazy_RCU): 2254 iops/s
There is also a reduced performance (as expected) with lazy RCU: -~13% with large I/O ops, but only <1% with small I/O ops, that is quite interesting.
Apart from the nohz_full oddity, everything else seems pretty smooth, no big issue or concern to report, and, as expected, there is a little improvement with HZ=1000 in terms of I/O performance in general.
One last metric to measure is the overhead of the tick interrupt itself, to make sure we don’t add any major overhead by enabling the extra config options.
For this, the following simple bpftrace script has been used, that measures how much time each hrtimer_interrupt
invocation takes and, when it’s stopped, it prints the time distribution of the different invocations.
#!/usr/bin/bpftrace
kprobe:hrtimer_interrupt {
@start[tid] = nsecs;
}
kretprobe:hrtimer_interrupt /@start[tid]/ {
@elapsed = hist(nsecs - @start[tid]);
delete(@start[tid]);
}
hrtimer_interrupt overhead (idle system)
- HZ=250:
[2K, 4K) 8 | |
[4K, 8K) 67 |@@ |
[8K, 16K) 1446 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
- HZ=1000:
[2K, 4K) 322 |@@@@@@@@@@@ |
[4K, 8K) 179 |@@@@@@ |
[8K, 16K) 1423 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
- HZ=1000 (nohz full):
[4K, 8K) 4072 |@@@@ |
[8K, 16K) 1701 |@ |
[16K, 32K) 52500 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K) 543 | |
- HZ=1000 (lazy RCU):
[8K, 16K) 193 |@@@@@@@@@ |
[16K, 32K) 1045 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K) 812 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
hrtimer_interrupt overhead (busy system, running fio)
- HZ=250:
[2K, 4K) 117924 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K) 322 | |
- HZ=1000:
[1K, 2K) 88307 |@@@@@@@@@@@@ |
[2K, 4K) 380695 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K) 194 | |
- HZ=1000 (nohz_full):
[512, 1K) 317 | |
[1K, 2K) 294 | |
[2K, 4K) 59568 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K) 368 | |
- HZ=1000 (lazy_RCU):
[2K, 4K) 3094 | |
[4K, 8K) 265763 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K) 182341 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[16K, 32K) 3422 | |
Something interesting to notice here.
-
The nohz_full case clearly shows more invocations when the system is idle (due to the timekeeping CPU constantly ticking).
-
The tick handler seems to be more expensive in the lazy_RCU case, like 2x slower (maybe enabling lazy RCUs adds some extra logic to the tick handler? This is something that requires further investigation)
-
For all the other cases the time distribution seems to be pretty uniform.
Conclusion
Test results do not show any significant performance regression between HZ=1000
vs HZ=250
.
Enabling the extra config options also doesn’t seem to introduce significant performance regression and it would provide users the flexibility to adjust the system at boot-time / run-time prioritizing 1) throughput, 2) responsiveness, or 3) power consumption, making the generic Ubuntu kernel even more “generic”.
There might be some special corner cases where these changes can cause performance regressions, but for the majority of the cases they can provide real performance benefits and a much greater flexibility.
Therefore, it seems reasonable to consider including these changes in the next Ubuntu kernel for the 24.04 release.
References
[1] Bug #2051342 “Enable lowlatency settings in the generic kernel” : Bugs : linux package : Ubuntu
[2] https://www.phoronix.com/news/Ubuntu-Generic-LL-Kernel