Now that we have isolated the major difference down to systemd.unified_cgroup_hierarchy=0, the average degradation for the 40 pairs ping-pong test for 24.04 is about 16.5%.
It seems mainline kernel 6.5 does not suffer from the main performance degradation issue of this thread, but mainline kernel 6.6-rc1 does. Therefore starting points for a kernel bisection are defined.
I did this latest work on a debian installation and also my build environment is on my 20.04 server. If I am going to bisect the kernel, then Iād want to setup kernel compile ability on the 24.04 server and go from there. It would take me awhile (like a week).
Wait wait waitā¦ in 6.6 we had the switch from CFS to EEVDF! Basically 6.6 has a totally different scheduler. So, even if the settings are identical it might do something completely different and the cgroup hierarchy can definitely affect the overall performance.
Very interesting. I only have a couple of steps left in the kernel bisection and did notice this:
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[147f3efaa24182a21706bca15eab2f3f4630b5fe] sched/fair: Implement an EEVDF-like scheduling policy
Anyway, at this point Iāll finish the bisection.
EDIT: Indeed, the scheduler change is the issue for this workflow:
doug@s19:~/kernel/linux$ git bisect good
147f3efaa24182a21706bca15eab2f3f4630b5fe is the first bad commit
commit 147f3efaa24182a21706bca15eab2f3f4630b5fe
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 13:58:44 2023 +0200
sched/fair: Implement an EEVDF-like scheduling policy
Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.
Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.
EEVDF has two parameters:
- weight, or time-slope: which is mapped to nice just as before
- request size, or slice length: which is used to compute
the virtual deadline as: vd_i = ve_i + r_i/w_i
Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.
Tick driven preemption is driven by request/slice completion; while
wakeup preemption is driven by the deadline.
Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
include/linux/sched.h | 4 +
kernel/sched/core.c | 1 +
kernel/sched/debug.c | 6 +-
kernel/sched/fair.c | 338 +++++++++++++++++++++++++++++++++++++++++-------
kernel/sched/features.h | 3 +
kernel/sched/sched.h | 4 +-
6 files changed, 308 insertions(+), 48 deletions(-)
While the scheduler change seems to be the original reason for the performance change, there is also the effect of systemd.unified_cgroup_hierarchy=0. After working on this intensely for January/February, I havenāt been able to get back to it in March. I still hope to get back to this at some point.