24.04 considerably slower than 20.04 or 22.04 for some high system percentage usage cases

dsmythies · February 11, 2024, 7:18pm

Okay, now we are getting somewhere. This is the grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp systemd.unified_cgroup_hierarchy=0 cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"

and now on 24.04 I get:

Samples: 80  ; Ave: 15.00202  ; Var:  0.20312  ; S Dev:  0.45069 ; Min: 14.05610 ; Max: 15.70310 ; Range:  1.64700 ; Comp to ref:   1.56%

Instead of the ~20% worse range from the earlier post.

arighi · February 11, 2024, 7:40pm

It really seems to be cgroup related then. Have you tried the same boot options also on 20.04?

dsmythies · February 11, 2024, 9:42pm

The new “Master” reference average, used in an earlier post was on 20.04 on an internal nvme drive with this grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"

That average of a few runs was 14.77114333 uSec per loop. The test is the 40 ping pong pairs, 30,000,000 loops per pair one.

These are the test results using this grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp systemd.unified_cgroup_hierarchy=0 cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"

24.04 on internal nvme drive (4 test runs, with result 1 being a repeat of earlier post):
Samples: 80  ; Ave: 15.00202  ; Var:  0.20312  ; S Dev:  0.45069 ; Min: 14.05610 ; Max: 15.70310 ; Range:  1.64700 ; Comp to ref:   1.56%
Samples: 80  ; Ave: 15.13712  ; Var:  0.19680  ; S Dev:  0.44362 ; Min: 14.00730 ; Max: 15.73140 ; Range:  1.72410 ; Comp to ref:   2.48%
Samples: 80  ; Ave: 14.97392  ; Var:  0.23093  ; S Dev:  0.48056 ; Min: 13.89290 ; Max: 15.70420 ; Range:  1.81130 ; Comp to ref:   1.37%
Samples: 80  ; Ave: 15.14079  ; Var:  0.17549  ; S Dev:  0.41892 ; Min: 14.10390 ; Max: 15.80830 ; Range:  1.70440 ; Comp to ref:   2.50%

20.04 on internal nvme drive (4 test runs):
Samples: 80  ; Ave: 14.83415  ; Var:  0.21711  ; S Dev:  0.46595 ; Min: 13.88660 ; Max: 15.56870 ; Range:  1.68210 ; Comp to ref:   0.43%
Samples: 80  ; Ave: 14.83119  ; Var:  0.21669  ; S Dev:  0.46550 ; Min: 13.82980 ; Max: 15.52110 ; Range:  1.69130 ; Comp to ref:   0.41%
Samples: 80  ; Ave: 15.00895  ; Var:  0.11571  ; S Dev:  0.34017 ; Min: 13.84090 ; Max: 15.60840 ; Range:  1.76750 ; Comp to ref:   1.61%
Samples: 80  ; Ave: 14.92387  ; Var:  0.15777  ; S Dev:  0.39720 ; Min: 14.04810 ; Max: 15.59750 ; Range:  1.54940 ; Comp to ref:   1.03%

dsmythies · February 18, 2024, 6:48pm

For an earlier suggestion:

grep . /proc/sys/kernel/sched*

the results are exactly the same between 20.04 and 24.04:

/proc/sys/kernel/sched_autogroup_enabled:1
/proc/sys/kernel/sched_cfs_bandwidth_slice_us:5000
/proc/sys/kernel/sched_child_runs_first:0
/proc/sys/kernel/sched_deadline_period_max_us:4194304
/proc/sys/kernel/sched_deadline_period_min_us:100
/proc/sys/kernel/sched_energy_aware:1
/proc/sys/kernel/sched_rr_timeslice_ms:100
/proc/sys/kernel/sched_rt_period_us:1000000
/proc/sys/kernel/sched_rt_runtime_us:950000
/proc/sys/kernel/sched_schedstats:0
/proc/sys/kernel/sched_util_clamp_max:1024
/proc/sys/kernel/sched_util_clamp_min:1024
/proc/sys/kernel/sched_util_clamp_min_rt_default:1024

Now that we have isolated the major difference down to systemd.unified_cgroup_hierarchy=0, the average degradation for the 40 pairs ping-pong test for 24.04 is about 16.5%.

I do not know how to investigate further.

dsmythies · February 19, 2024, 11:56pm

It seems mainline kernel 6.5 does not suffer from the main performance degradation issue of this thread, but mainline kernel 6.6-rc1 does. Therefore starting points for a kernel bisection are defined.
I did this latest work on a debian installation and also my build environment is on my 20.04 server. If I am going to bisect the kernel, then I’d want to setup kernel compile ability on the 24.04 server and go from there. It would take me awhile (like a week).

arighi · February 21, 2024, 1:43pm

Wait wait wait… in 6.6 we had the switch from CFS to EEVDF! Basically 6.6 has a totally different scheduler. So, even if the settings are identical it might do something completely different and the cgroup hierarchy can definitely affect the overall performance.

dsmythies · February 21, 2024, 2:58pm

Very interesting. I only have a couple of steps left in the kernel bisection and did notice this:

Bisecting: 3 revisions left to test after this (roughly 2 steps)
[147f3efaa24182a21706bca15eab2f3f4630b5fe] sched/fair: Implement an EEVDF-like scheduling policy

Anyway, at this point I’ll finish the bisection.

EDIT: Indeed, the scheduler change is the issue for this workflow:

doug@s19:~/kernel/linux$ git bisect good
147f3efaa24182a21706bca15eab2f3f4630b5fe is the first bad commit
commit 147f3efaa24182a21706bca15eab2f3f4630b5fe
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed May 31 13:58:44 2023 +0200

    sched/fair: Implement an EEVDF-like scheduling policy

    Where CFS is currently a WFQ based scheduler with only a single knob,
    the weight. The addition of a second, latency oriented parameter,
    makes something like WF2Q or EEVDF based a much better fit.

    Specifically, EEVDF does EDF like scheduling in the left half of the
    tree -- those entities that are owed service. Except because this is a
    virtual time scheduler, the deadlines are in virtual time as well,
    which is what allows over-subscription.

    EEVDF has two parameters:

     - weight, or time-slope: which is mapped to nice just as before

     - request size, or slice length: which is used to compute
       the virtual deadline as: vd_i = ve_i + r_i/w_i

    Basically, by setting a smaller slice, the deadline will be earlier
    and the task will be more eligible and ran earlier.

    Tick driven preemption is driven by request/slice completion; while
    wakeup preemption is driven by the deadline.

    Because the tree is now effectively an interval tree, and the
    selection is no longer 'leftmost', over-scheduling is less of a
    problem.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org

 include/linux/sched.h   |   4 +
 kernel/sched/core.c     |   1 +
 kernel/sched/debug.c    |   6 +-
 kernel/sched/fair.c     | 338 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |   3 +
 kernel/sched/sched.h    |   4 +-
 6 files changed, 308 insertions(+), 48 deletions(-)

dsmythies · March 24, 2024, 3:17pm

While the scheduler change seems to be the original reason for the performance change, there is also the effect of systemd.unified_cgroup_hierarchy=0. After working on this intensely for January/February, I haven’t been able to get back to it in March. I still hope to get back to this at some point.

I did observe this:

NOT Present: systemd.unified_cgroup_hierarchy=0
(yes, seems backwards.)
doug@s19:~$ cat /proc/cgroups | column -t
#subsys_name  hierarchy  num_cgroups  enabled
cpuset        0          73           1
cpu           0          73           1
cpuacct       0          73           1
blkio         0          73           1
memory        0          73           1
devices       0          73           1
freezer       0          73           1
net_cls       0          73           1
perf_event    0          73           1
net_prio      0          73           1
hugetlb       0          73           1
pids          0          73           1
rdma          0          73           1
misc          0          73           1

and

systemd.unified_cgroup_hierarchy=0

doug@s19:~$ cat /proc/cgroups | column -t
#subsys_name  hierarchy  num_cgroups  enabled
cpuset        8          1            1
cpu           7          1            1
cpuacct       7          1            1
blkio         9          1            1
memory        2          72           1
devices       4          32           1
freezer       13         1            1
net_cls       6          1            1
perf_event    3          1            1
net_prio      6          1            1
hugetlb       5          1            1
pids          11         36           1
rdma          10         1            1
misc          12         1            1