24.04 considerably slower than 20.04 or 22.04 for some high system percentage usage cases

Okay, now we are getting somewhere. This is the grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp systemd.unified_cgroup_hierarchy=0 cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"

and now on 24.04 I get:

Samples: 80  ; Ave: 15.00202  ; Var:  0.20312  ; S Dev:  0.45069 ; Min: 14.05610 ; Max: 15.70310 ; Range:  1.64700 ; Comp to ref:   1.56%

Instead of the ~20% worse range from the earlier post.

It really seems to be cgroup related then. Have you tried the same boot options also on 20.04?

The new “Master” reference average, used in an earlier post was on 20.04 on an internal nvme drive with this grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"

That average of a few runs was 14.77114333 uSec per loop. The test is the 40 ping pong pairs, 30,000,000 loops per pair one.

These are the test results using this grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp systemd.unified_cgroup_hierarchy=0 cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"
24.04 on internal nvme drive (4 test runs, with result 1 being a repeat of earlier post):
Samples: 80  ; Ave: 15.00202  ; Var:  0.20312  ; S Dev:  0.45069 ; Min: 14.05610 ; Max: 15.70310 ; Range:  1.64700 ; Comp to ref:   1.56%
Samples: 80  ; Ave: 15.13712  ; Var:  0.19680  ; S Dev:  0.44362 ; Min: 14.00730 ; Max: 15.73140 ; Range:  1.72410 ; Comp to ref:   2.48%
Samples: 80  ; Ave: 14.97392  ; Var:  0.23093  ; S Dev:  0.48056 ; Min: 13.89290 ; Max: 15.70420 ; Range:  1.81130 ; Comp to ref:   1.37%
Samples: 80  ; Ave: 15.14079  ; Var:  0.17549  ; S Dev:  0.41892 ; Min: 14.10390 ; Max: 15.80830 ; Range:  1.70440 ; Comp to ref:   2.50%

20.04 on internal nvme drive (4 test runs):
Samples: 80  ; Ave: 14.83415  ; Var:  0.21711  ; S Dev:  0.46595 ; Min: 13.88660 ; Max: 15.56870 ; Range:  1.68210 ; Comp to ref:   0.43%
Samples: 80  ; Ave: 14.83119  ; Var:  0.21669  ; S Dev:  0.46550 ; Min: 13.82980 ; Max: 15.52110 ; Range:  1.69130 ; Comp to ref:   0.41%
Samples: 80  ; Ave: 15.00895  ; Var:  0.11571  ; S Dev:  0.34017 ; Min: 13.84090 ; Max: 15.60840 ; Range:  1.76750 ; Comp to ref:   1.61%
Samples: 80  ; Ave: 14.92387  ; Var:  0.15777  ; S Dev:  0.39720 ; Min: 14.04810 ; Max: 15.59750 ; Range:  1.54940 ; Comp to ref:   1.03%

For an earlier suggestion:

grep . /proc/sys/kernel/sched*

the results are exactly the same between 20.04 and 24.04:

/proc/sys/kernel/sched_autogroup_enabled:1
/proc/sys/kernel/sched_cfs_bandwidth_slice_us:5000
/proc/sys/kernel/sched_child_runs_first:0
/proc/sys/kernel/sched_deadline_period_max_us:4194304
/proc/sys/kernel/sched_deadline_period_min_us:100
/proc/sys/kernel/sched_energy_aware:1
/proc/sys/kernel/sched_rr_timeslice_ms:100
/proc/sys/kernel/sched_rt_period_us:1000000
/proc/sys/kernel/sched_rt_runtime_us:950000
/proc/sys/kernel/sched_schedstats:0
/proc/sys/kernel/sched_util_clamp_max:1024
/proc/sys/kernel/sched_util_clamp_min:1024
/proc/sys/kernel/sched_util_clamp_min_rt_default:1024

Now that we have isolated the major difference down to systemd.unified_cgroup_hierarchy=0, the average degradation for the 40 pairs ping-pong test for 24.04 is about 16.5%.

I do not know how to investigate further.

It seems mainline kernel 6.5 does not suffer from the main performance degradation issue of this thread, but mainline kernel 6.6-rc1 does. Therefore starting points for a kernel bisection are defined.
I did this latest work on a debian installation and also my build environment is on my 20.04 server. If I am going to bisect the kernel, then I’d want to setup kernel compile ability on the 24.04 server and go from there. It would take me awhile (like a week).

Wait wait wait… in 6.6 we had the switch from CFS to EEVDF! Basically 6.6 has a totally different scheduler. So, even if the settings are identical it might do something completely different and the cgroup hierarchy can definitely affect the overall performance.

Very interesting. I only have a couple of steps left in the kernel bisection and did notice this:

Bisecting: 3 revisions left to test after this (roughly 2 steps)
[147f3efaa24182a21706bca15eab2f3f4630b5fe] sched/fair: Implement an EEVDF-like scheduling policy

Anyway, at this point I’ll finish the bisection.

EDIT: Indeed, the scheduler change is the issue for this workflow:

doug@s19:~/kernel/linux$ git bisect good
147f3efaa24182a21706bca15eab2f3f4630b5fe is the first bad commit
commit 147f3efaa24182a21706bca15eab2f3f4630b5fe
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed May 31 13:58:44 2023 +0200

    sched/fair: Implement an EEVDF-like scheduling policy

    Where CFS is currently a WFQ based scheduler with only a single knob,
    the weight. The addition of a second, latency oriented parameter,
    makes something like WF2Q or EEVDF based a much better fit.

    Specifically, EEVDF does EDF like scheduling in the left half of the
    tree -- those entities that are owed service. Except because this is a
    virtual time scheduler, the deadlines are in virtual time as well,
    which is what allows over-subscription.

    EEVDF has two parameters:

     - weight, or time-slope: which is mapped to nice just as before

     - request size, or slice length: which is used to compute
       the virtual deadline as: vd_i = ve_i + r_i/w_i

    Basically, by setting a smaller slice, the deadline will be earlier
    and the task will be more eligible and ran earlier.

    Tick driven preemption is driven by request/slice completion; while
    wakeup preemption is driven by the deadline.

    Because the tree is now effectively an interval tree, and the
    selection is no longer 'leftmost', over-scheduling is less of a
    problem.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org

 include/linux/sched.h   |   4 +
 kernel/sched/core.c     |   1 +
 kernel/sched/debug.c    |   6 +-
 kernel/sched/fair.c     | 338 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |   3 +
 kernel/sched/sched.h    |   4 +-
 6 files changed, 308 insertions(+), 48 deletions(-)
2 Likes