24.04 considerably slower than 20.04 or 22.04 for some high system percentage usage cases

dsmythies · January 29, 2024, 1:34am

Background: During recent work on the irqbalance question, throughput differences were observed between Ubuntu server 20.04 and 24.04 using the exact same kernel. A considerable investigation ensued. Multiple workflows that generate high frequency interrupts were used, with the objective of dwelling in shallow idle states.

Note: The kernel itself has been eliminated as the root issue. So why post here under “kernel”? Because it is likely the best stake holder audience, that might like to be aware of the issue.

The issue: System overhead seems to be higher on 24.04 verses 20.04. If I simplify the work flow to use only 1 CPU, there is no degradation. If 2 CPUs are used, but on the same core there is about a 10% degradation. If 2 CPUs are used, on different cores, the degradation is more.

In an attempt to eliminate the variables, The following has been tried:
. look for significant differences in the boot logs.
. compared the loaded module list looking for differences. There were a couple.
. looked for different default configurations. Could well have missed something.
. make the entire /usr/lib/firmware directory a copy of the git master
. look for active services differences, and disable ones that don’t matter for this.
. try various idle governors. Recent testing puts 100% load on all CPUs.
. use performance CPU frequency scaling governor. No longer relevant as load is 100%.
. ensure that there is no power or thermal or active cores CPU throttling.
. compile the test program c code as “static” to eliminate run time library differences.

The current test runs 40 ping-pong token passing pairs. The exact same Ubuntu kernel 6.6.0-14-generic was used on the exact same hardware booted as 20.04, 22.04, and 24.04. Each test takes between 8 and 9 minutes, at very roughly 92% system time. The results per ping-pong pair are in the attached graph.

System info:
Intel(R) Core™ i5-10600K CPU @ 4.10GHz
Motherboard: ASUS Prime Z490-A

arighi · January 29, 2024, 1:26pm

Have you tried to recompile the C program as static on 20.04 and run the binary on 24.04? Wondering if the regression could be caused by toolchain differences…

dsmythies · January 29, 2024, 3:21pm

Yes. The tests, as presented in the graph, were all done with the exact same binary of the test c program compiled with the --static option on 20.04. It was done this way in an attempt to eliminate toolchain differences as a test to test variable.

arighi · January 29, 2024, 3:53pm

If the kernel is exactly the same and the binary is exactly the same (it is even statically build), the only other thing that may disrupt performance is another task(s)/service(s) that is(are) running in the system. I would try to look at perf top while the test is running, or collect some info with profile-bpfcc -F 99 and see is something obvious shows up.

dsmythies · January 30, 2024, 1:10am

I agree, but have not been able to figure out what is taking the extra time during system calls.

Note 1: My test server is incredibly idle.

Note 2: To reduce the amounts of data generated for trace and profile-bpfcc I changed my test to one ping-pong pair, forced CPU affinity, and crossing cores. While this introduces idle state selection as a testing variable, other tests where idle is not involved have already eliminated it as an explanation. If it matters idle state selection, and hence entry and exit latencies, could be forced by disabling the other idle states.

I have tried to identify using top, and never see anything else taking up CPU time. For example I did a 3000 second sample interval top session and the biggest CPU time consumer, other than my 2 tasks was kworker/4:2-events at 0.24 seconds.

I have also run trace many times. Since the system calls are so short and trace only does time stamping to the microsecond, it is difficult to obtain adequate timing resolution. By forcing the CPU frequency to its minimum of 800 MHz, some extra resolution can be obtained. 30 second traces of sys calls were run:

20.04:
CPU 3: 2,768,775 sys_calls: 2uS min, 9.8 uSec ave, 127 uSec max
CPU 4: 2,768,775 sys_calls: 1uS min, 9.8 uSec ave, 121 uSec max
24.04:
CPU 3: 2,469,781 sys_calls: 0uS min, 11.1 uSec ave, 2300 uSec max
CPU 4: 2,469,781 sys_calls: 0uS min, 11.1 uSec ave, 2057 uSec max

The extra long calls were all analyzed and were very infrequent. 30 out of 2.4e6, or 0.001%
System calls just take an average of 13% longer on 24.04 for this particular test.

I have never used profile-bpfcc, but did so. I compared the number of occurrences of each string from a 50 second capture and didn’t observe any significant difference. Command used:

sudo profile-bpfcc --stack-storage-size 1600000 -F 99 50 > profile-10.txt

Well, the number of mutex_locks was 22 for 20.04 and 42 for 24.04, but it is an under sampled system, so I don’t know if that is significant or not. All the other string counts were more like +/- 4 of each other. Command used:

cat profile-10.txt | sort | uniq -c | sort -g > profile-10.uniq

dsmythies · January 30, 2024, 5:16pm

Further to my previous post about system calls: That work was done using the /sys/kernel/debug/tracing/events/raw_syscalls trace abilities. I redid similar testing enabling all available /sys/kernel/debug/tracing/events/syscalls abilities. The traces ran for 10 seconds. Summary:

20.04:

doug@s19:~/idle/perf/results/t202$ cat 003.txt | sed 's/ /\n/g' | grep sys_ | sort | uniq -c | sort -g
 457372 sys_write
 457647 sys_read
 458163 sys_write(fd:
 458433 sys_read(fd:
doug@s19:~/idle/perf/results/t202$ cat 004.txt | sed 's/ /\n/g' | grep sys_ | sort | uniq -c | sort -g
 457372 sys_write
 457647 sys_read
 458162 sys_write(fd:
 458432 sys_read(fd:

24.04:

doug@s19:/media/nvme/home/doug/idle/perf/results/t242$ cat 003.txt | sed 's/ /\n/g' | grep sys_ | sort | uniq -c | sort -g
     38 sys_futex
     38 sys_futex(uaddr:
 409603 sys_write
 409847 sys_read
 410307 sys_write(fd:
 410543 sys_read(fd:
doug@s19:/media/nvme/home/doug/idle/perf/results/t242$ cat 004.txt | sed 's/ /\n/g' | grep sys_ | sort | uniq -c | sort -g
 409603 sys_write
 409848 sys_read
 410307 sys_write(fd:
 410544 sys_read(fd:

Note 1: The count differences within each listing is due to the time between enableing each trace and then disabling them after the trace period. For example, sys_write exit was the last trace event to be enabled and sys_read entry was the first.

Note 2: The 38 futex stuff on cpu 3 took almost no time.

Note 3: the rest of the system calls, not on cpus 3 or 4 (the test CPUs), were mostly to do with the script running the trace itself.

The count differences between 24.04 and 20.04 are consistent (11.7%) with the test results differences.

arighi · February 1, 2024, 8:20am

How about cgroups configuration, I was wondering if systemd is setting up cgroups differently between 20.04 and 24.04 , ausing the kernel to hit different code paths. Maybe you could also try to disable cgroups completely and see if it makes any difference.

dsmythies · February 1, 2024, 3:36pm

That is a really good idea, and exactly the type of input I was looking for. At this point it has to be something like some configuration difference causing different code execution paths within the system calls.

I am not sure how to disable CGROUPS entirely. The best I can figure out is via kernel configuration, changing CONFIG_CGROUPS=y to no. I wonder if such a kernel will even work? I guess I’ll find out.

I only ever compile mainline kernels. I’ll be doing this test with mainline kernel 6.8-rc2.

EDIT: I can not get #CONFIG_CGROUPS is not set in the kernel configuration to stick. It always gets enabled again at the start of the kernel compile. I did turn off what I could via make menuconfig, and so have this:

doug@s19:~/kernel/linux$ scripts/diffconfig .config-6.8-rc2-nohz .config
-BFQ_CGROUP_DEBUG n
-BFQ_GROUP_IOSCHED y
-BLK_CGROUP_FC_APPID y
-BLK_CGROUP_IOCOST y
-BLK_CGROUP_IOLATENCY n
-BLK_CGROUP_IOPRIO y
-BLK_CGROUP_RWSTAT y
-BLK_DEV_THROTTLING y
-BLK_DEV_THROTTLING_LOW n
-BLK_RQ_ALLOC_TIME y
-BPF_STREAM_PARSER y
-CGROUP_WRITEBACK y
-F2FS_UNFAIR_RWSEM y
-MEMCG_KMEM y
-PAGE_COUNTER y
-PROC_PID_CPUSET y
 BLK_CGROUP y -> n
 CFS_BANDWIDTH y -> n
 CGROUP_BPF y -> n
 CGROUP_CPUACCT y -> n
 CGROUP_DEVICE y -> n
 CGROUP_FREEZER y -> n
 CGROUP_HUGETLB y -> n
 CGROUP_MISC y -> n
 CGROUP_PERF y -> n
 CGROUP_PIDS y -> n
 CGROUP_RDMA y -> n
 CPUSETS y -> n
 MEMCG y -> n
 UCLAMP_TASK_GROUP y -> n

dsmythies · February 2, 2024, 5:38am

The kernel compiled with the configuration of my previous post made no difference with respect to the throughput degradation issue between 20.04 and 24.04.

I also tried disabling apparmor on both 20.04 and 24.04

24.04 is installed on an external SSD via a sata to USB adapter. 20.04 is installed on an internal nvme drive. I tried mounting the 24.04 drive while running 20.04. It made no difference. Note that there is no drive IO during the test anyhow, and the ping pong token is passed via named pipes in /dev/shm files.

arighi · February 2, 2024, 6:50am

You just need to boot with “cgroup_disable=memory”, that is likely to be the one that has most impact on performance, you don’t have to recompile the kernel. Also in your config diff you haven’t disabled the memory cgroup subsystem.

ogra · February 2, 2024, 12:30pm

You are aware that a USB attached rotary disk is significantly slower (due to USB I/O throughput, mechanical slowdown due to the rotary disk, etc) than a direct pci attached NVME, right ?

You should really run both OSes from the same disk setup when trying to compare… do a 20.04 install on the USB disk and check the speed…

dsmythies · February 2, 2024, 3:32pm

Thanks for the “how to”. The cgroup_disable=memory made no difference. I did confirm the directive worked:

2024-02-02T15:02:41.540483+00:00 s19 kernel: cgroup: Disabling memory control group subsystem

dsmythies · February 2, 2024, 3:36pm

Thanks for chiming in. Yes, I am aware of the differences, but it was/is beyond my comprehensions as to why it would matter since there is no disk IO involved in the test. Anyway, as we are running out of ideas, I’ll try the test but a different way, using another nvme drive. It’ll take me a few days.

arighi · February 3, 2024, 11:28am

Ah… I totally missed that one distro is on an external SSD and the other is on USB. You may not do I/O but your libraries, binaries, etc. may be paged out and if they page fault later on while you run the test, that would cause I/O and it can definitely affect performance. Maybe keep track of the I/O activity while your test is running.

dsmythies · February 6, 2024, 3:23pm

There is very little IO during these tests. Here is some VM stat monitoring during test runs:

doug@s19:~/idle/perf/results/20-24-compare/no-hwp/teo/pp40$ cat io-example.txt
2024.02.03 shows pretty much no I/O during the test:

OS: 20.04

doug@s19:~/c$ vmstat --unit M -n 15
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0  27257   2129   2173    0    0     6    22   16   70  0  2 98  0  0
 0  0      0  27257   2129   2173    0    0     0     1   55   77  0  0 100  0  0
 0  0      0  27257   2129   2173    0    0     0     0   55   89  0  0 100  0  0
 0  0      0  27257   2129   2173    0    0     0     0   53   73  0  0 100  0  0
42  0      0  27254   2129   2173    0    0     0     0 2911 4469020  8 87  4  0  0
46  0      0  27254   2129   2173    0    0     0     0 3055 4697276  9 91  0  0  0
40  0      0  27254   2129   2173    0    0     0     0 3066 4690203  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3014 4690198  8 92  0  0  0
40  0      0  27254   2129   2173    0    0     0     0 3039 4695277  9 91  0  0  0
40  0      0  27254   2129   2173    0    0     0     0 3041 4705429  8 92  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3015 4702999  9 91  0  0  0
43  0      0  27254   2129   2173    0    0     0     0 3046 4704605  8 92  0  0  0
46  0      0  27254   2129   2173    0    0     0     0 3074 4695933  9 91  0  0  0
45  0      0  27254   2129   2173    0    0     0     0 3015 4707250  9 91  0  0  0
46  0      0  27254   2129   2173    0    0     0     0 3039 4706756  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3047 4700811  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3014 4705889  8 92  0  0  0
46  0      0  27254   2129   2173    0    0     0     1 3041 4702034  9 91  0  0  0
44  0      0  27254   2129   2173    0    0     0     0 3043 4698406  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3022 4702671  9 91  0  0  0
41  0      0  27254   2129   2173    0    0     0     0 3039 4707017  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3042 4702490  8 92  0  0  0
45  0      0  27254   2129   2173    0    0     0     0 3016 4699933  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3057 4696263  9 91  0  0  0
41  0      0  27254   2129   2173    0    0     0     0 3042 4694493  9 91  0  0  0
44  0      0  27254   2129   2173    0    0     0     0 3015 4701169  9 91  0  0  0
41  0      0  27254   2129   2173    0    0     0     0 3040 4702579  8 92  0  0  0
45  0      0  27254   2129   2173    0    0     0     0 3049 4702101  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3015 4704366  9 91  0  0  0
41  0      0  27254   2129   2173    0    0     0     0 3042 4699218  8 92  0  0  0
44  0      0  27254   2129   2173    0    0     0     0 3041 4692234  9 91  0  0  0
43  0      0  27254   2129   2173    0    0     0     0 3022 4696061  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3042 4701076  9 91  0  0  0
45  0      0  27254   2129   2173    0    0     0     0 3042 4702727  9 91  0  0  0
39  0      0  27254   2129   2173    0    0     0     0 3017 4711968  9 91  0  0  0
35  0      0  27254   2129   2173    0    0     0     0 3150 4811568  9 91  0  0  0
24  0      0  27262   2129   2173    0    0     0     0 3293 5020920  8 92  0  0  0
 5  0      0  27269   2129   2173    0    0     0     0 97609 4852769  7 76 17  0  0
 0  0      0  27277   2129   2173    0    0     0     0 12045 1692706  1 12 87  0  0
 0  0      0  27277   2129   2173    0    0     0     0   56   88  0  0 100  0  0
 0  0      0  27277   2129   2173    0    0     0     0   21   38  0  0 100  0  0


OS: 24.04

doug@s19:/media/nvme/home/doug/idle/perf/results/20-24-compare/no-hwp/teo/pp40$ vmstat --unit M -n 15
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 0  0      0  31235     75    356    0    0   874    56  669    0  0  0 100  0  0  0
 0  0      0  31235     75    356    0    0     0     0   49   73  0  0 100  0  0  0
 0  0      0  31235     75    356    0    0     0     4   56   85  0  0 100  0  0  0
 0  0      0  31235     75    356    0    0     0     0   21   35  0  0 100  0  0  0
41  0      0  31225     75    357    0    0    56    12 2827 3609278  7 85  8  0  0  0
41  0      0  31225     75    357    0    0     0     1 3043 3911709  7 93  0  0  0  0
41  0      0  31225     75    357    0    0     0     1 3014 3915416  7 93  0  0  0  0
47  0      0  31225     75    357    0    0     0     0 3039 3912586  7 93  0  0  0  0
45  0      0  31225     75    357    0    0     0     0 3048 3912526  7 93  0  0  0  0
44  0      0  31225     75    357    0    0     0     1 3014 3914000  7 93  0  0  0  0
44  0      0  31225     75    357    0    0     0     0 3040 3911553  8 92  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3038 3911487  7 93  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3173 3915021  7 93  0  0  0  0
46  0      0  31225     75    357    0    0     0     1 3913 3911987  6 94  0  0  0  0
42  0      0  31225     75    357    0    0     0     0 3515 3913231  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     0 3075 3915327  8 92  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3039 3911330  7 93  0  0  0  0
44  0      0  31225     75    357    0    0     0     1 3050 3910741  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     0 3016 3915956  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     0 3037 3911342  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     2 3044 3909436  8 92  0  0  0  0
44  0      0  31225     75    357    0    0     0     1 3023 3914635  7 93  0  0  0  0
42  0      0  31225     75    357    0    0     0     0 3042 3909317  7 93  0  0  0  0
44  0      0  31225     75    357    0    0     0     0 3038 3908718  7 93  0  0  0  0
42  0      0  31225     75    357    0    0     0     0 3014 3910827  7 93  0  0  0  0
46  0      0  31225     75    357    0    0     0     0 3049 3912835  7 93  0  0  0  0
40  0      0  31225     75    357    0    0     0     1 3044 3906396  8 92  0  0  0  0
43  0      0  31225     75    357    0    0     0     0 3012 3911503  8 92  0  0  0  0
42  0      0  31225     75    357    0    0     0     0 3041 3908481  7 93  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3048 3910543  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     0 3014 3921357  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     1 3041 3905356  7 93  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3040 3905252  8 92  0  0  0  0
46  0      0  31225     75    357    0    0     0     0 3022 3908858  8 92  0  0  0  0
46  0      0  31225     75    357    0    0     0     0 3040 3906519  7 93  0  0  0  0
42  0      0  31225     75    357    0    0     0     1 3040 3905895  8 92  0  0  0  0
42  0      0  31225     75    357    0    0     0     0 3015 3910260  7 93  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3048 3906123  7 93  0  0  0  0
42  0      0  31223     76    357    0    0    30    13 3070 3908430  7 93  0  0  0  0
42  0      0  31223     76    357    0    0     0     0 3013 3911228  7 93  0  0  0  0
34  0      0  31223     76    357    0    0     0     1 3047 3928800  7 93  0  0  0  0
31  0      0  31223     76    357    0    0     0     1 3054 4049397  7 93  0  0  0  0
30  0      0  31223     76    357    0    0     0     0 3019 4101428  7 93  0  0  0  0
18  0      0  31223     76    357    0    0     0     0 8242 4397009  7 93  0  0  0  0
 2  0      0  31223     76    357    0    0     0     0 41870 3875590  5 53 42  0  0  0
 0  0      0  31223     76    357    0    0     0     1 1696 1324425  1  9 91  0  0  0
 0  0      0  31223     76    357    0    0     0     0   49   74  0  0 100  0  0  0
 0  0      0  31223     76    357    0    0     0     1   50   74  0  0 100  0  0  0
 0  0      0  31223     76    357    0    0     0     0   46   69  0  0 100  0  0  0
 0  0      0  31223     76    357    0    0     0     0   57   90  0  0 100  0  0  0

OS: 20.04 via USB stick:

doug@s19:~$ vmstat --unit M -n 15
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0  31136     49    435    0    0    12     0  111  352  3 31 66  0  0
 0  0      0  31136     49    435    0    0     0     1   38   62  0  0 100  0  0
 0  0      0  31136     49    435    0    0     0     0   51   77  0  0 100  0  0
44  0      0  31125     49    435    0    0     0     0  101 27715  0  1 99  0  0
45  0      0  31125     49    435    0    0     0     0 3040 4568272  9 91  0  0  0
42  0      0  31125     49    435    0    0     0     0 3023 4576509  8 92  0  0  0
46  0      0  31125     49    435    0    0     0     0 3041 4568897  8 92  0  0  0
43  0      0  31125     49    435    0    0     0     0 3041 4572075  9 91  0  0  0
40  0      0  31125     49    435    0    0     0     0 3016 4575675  8 92  0  0  0
40  0      0  31125     49    435    0    0     0     0 3047 4568539  8 92  0  0  0
40  0      0  31125     49    435    0    0     0     0 3041 4572851  9 91  0  0  0
41  0      0  31125     49    435    0    0     0     0 3015 4578205  8 92  0  0  0
42  0      0  31125     49    435    0    0     0     0 3042 4574618  8 92  0  0  0
42  0      0  31125     49    435    0    0     0     0 3048 4575378  8 92  0  0  0
40  0      0  31125     49    435    0    0     0     0 3016 4570631  8 92  0  0  0
44  0      0  31125     49    435    0    0     0     0 3042 4576007  8 92  0  0  0
44  0      0  31125     49    435    0    0     0     0 3045 4575002  8 92  0  0  0
42  0      0  31125     49    435    0    0     0     1 3028 4578231  9 91  0  0  0
44  0      0  31125     49    435    0    0     0     0 3041 4570888  8 92  0  0  0
41  0      0  31125     49    435    0    0     0     0 3041 4570848  8 92  0  0  0
44  0      0  31125     49    435    0    0     0     0 3016 4573795  8 92  0  0  0
42  0      0  31125     49    435    0    0     0     0 3051 4572225  8 92  0  0  0
44  0      0  31118     49    435    0    0     0    20 3060 4565321  8 92  0  0  0
44  0      0  31118     49    435    0    0     0     4 3020 4574698  8 92  0  0  0
44  0      0  31118     49    435    0    0     0     0 3039 4570885  9 91  0  0  0
42  0      0  31118     49    435    0    0     0     0 3049 4567227  8 92  0  0  0
45  0      0  31118     49    435    0    0     0     0 3016 4575568  8 92  0  0  0
40  0      0  31118     49    435    0    0     0     0 3043 4573230  9 91  0  0  0
45  0      0  31118     49    435    0    0     0     0 3043 4574706  8 92  0  0  0
42  0      0  31118     49    435    0    0     0     0 3023 4569284  8 92  0  0  0
42  0      0  31118     49    435    0    0     0     0 3041 4571018  9 91  0  0  0
40  0      0  31118     49    435    0    0     0     0 3041 4571103  8 92  0  0  0
42  0      0  31118     49    435    0    0     0     0 3015 4571553  8 92  0  0  0
40  0      0  31118     49    435    0    0     0     0 3039 4569022  8 92  0  0  0
44  0      0  31118     49    435    0    0     0     0 3048 4574169  8 92  0  0  0
36  0      0  31118     49    435    0    0     0     0 3023 4598446  9 91  0  0  0
32  0      0  31118     49    435    0    0     0     0 3116 4728334  8 92  0  0  0
25  0      0  31118     49    435    0    0     0     0 3093 4830644  8 92  0  0  0
 4  0      0  31141     49    435    0    0     0     0 54657 4599783  6 68 26  0  0
 0  0      0  31141     49    435    0    0     0     0 1661 1143147  1  7 92  0  0
 0  0      0  31141     49    435    0    0     0     0   49   75  0  0 100  0  0
 0  0      0  31141     49    435    0    0     0     0   24   42  0  0 100  0  0

OS: 24.04 internal nvme drive

doug@s19:/media/nvme/home/doug/idle/perf/results/20-24-compare/no-hwp/teo/pp40$ vmstat --unit M -n 15
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 1  0      0  31041    112    500    0    0    26     5 1105  898  2 24 74  0  0  0
 0  0      0  31040    112    500    0    0     0     3   51   62  0  0 100  0  0  0
 0  0      0  31039    112    500    0    0     0     1   67  103  0  0 100  0  0  0
 0  0      0  31039    112    500    0    0     0     2   49   66  0  0 100  0  0  0
 0  0      0  31039    112    500    0    0     0     6   18   27  0  0 100  0  0  0
 0  0      0  31039    112    500    0    0     0     3   49   66  0  0 100  0  0  0
 0  0      0  31039    112    500    0    0     0     0   53   79  0  0 100  0  0  0
43  0      0  31029    112    500    0    0     0     1 2341 3279040  6 71 23  0  0  0
41  0      0  31029    112    500    0    0     0     1 3039 4271810  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3039 4271056  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3021 4275941  8 92  0  0  0  0
43  0      0  31029    112    500    0    0     0     0 3040 4272622  8 92  0  0  0  0
43  0      0  31029    112    500    0    0     0     0 3041 4271104  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     1 3013 4274997  8 92  0  0  0  0
41  0      0  31029    112    500    0    0     0     0 3051 4289365  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3040 4287240  8 92  0  0  0  0
43  0      0  31029    112    500    0    0     0     0 3013 4277850  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     1 3039 4277966  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3039 4274327  8 92  0  0  0  0
46  0      0  31029    112    500    0    0     0     1 3022 4276595  8 92  0  0  0  0
40  0      0  31029    112    500    0    0     0     0 3040 4274199  8 92  0  0  0  0
43  0      0  31029    112    500    0    0     0     0 3039 4276948  8 92  0  0  0  0
45  0      0  31029    112    500    0    0     0     0 3016 4279941  8 92  0  0  0  0
40  0      0  31029    112    500    0    0     0     1 3046 4273451  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3039 4281308  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3013 4273600  8 92  0  0  0  0
44  0      0  31029    112    500    0    0     0     0 3040 4272102  8 92  0  0  0  0
43  0      0  31029    112    500    0    0     0     0 3047 4274355  8 92  0  0  0  0
40  0      0  31029    112    500    0    0     0     1 3013 4278047  8 92  0  0  0  0
45  0      0  31029    112    500    0    0     0     0 3042 4275542  8 92  0  0  0  0
45  0      0  31029    112    500    0    0     0     0 3039 4277716  8 92  0  0  0  0
41  0      0  31029    112    500    0    0     0     0 3021 4280635  8 92  0  0  0  0
44  0      0  31029    112    500    0    0     0     0 3040 4277454  8 92  0  0  0  0
45  0      0  31029    112    500    0    0     0     1 3039 4276449  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3014 4280708  8 92  0  0  0  0
45  0      0  31029    112    500    0    0     0     0 3047 4276838  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3039 4275901  8 92  0  0  0  0
48  0      0  31029    112    500    0    0     0     1 3015 4277549  8 92  0  0  0  0
48  0      0  31029    112    500    0    0     0     0 3039 4276942  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3047 4276270  8 92  0  0  0  0
38  0      0  31029    112    500    0    0     0     0 3016 4314929  8 92  0  0  0  0
32  0      0  31029    112    500    0    0     0     0 3042 4458087  7 93  0  0  0  0
27  0      0  31029    112    500    0    0     0     1 3044 4576553  7 93  0  0  0  0
15  0      0  31029    112    500    0    0     0     0 9537 4837276  7 93  0  0  0  0
 0  0      0  31029    112    500    0    0     0     0 36472 2771113  3 32 65  0  0  0
 0  0      0  31029    112    500    0    0     0     0   45   62  0  0 100  0  0  0
 0  0      0  31029    112    500    0    0     0     0   18   27  0  0 100  0  0  0
 0  0      0  31029    112    500    0    0     0     1   55   80  0  0 100  0  0  0
 0  0      0  31029    112    500    0    0     0     0   46   62  0  0 100  0  0  0

I did several tests for boot drive location: Original 20.04 on internal nvme drive; 22.04 on external nvme via USB; 24.04 on external SSD via USB; 22.04 on internal nvme; 20.04 on external USB stick; 24.04 on internal nvme.
The test is 40 ping pong pairs, no work per token stop, 30 million loops. The numbers are uSec per loop and less is better.
A subset of the comparison data, kernel 6.6.0-14:

More cgroup stuff: cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all

20.04 nvme: individual runs that result in the "master" average (14.77114333):
1: Samples: 80  ; Ave: 14.79884  ; Var:  0.17892  ; S Dev:  0.42298 ; Min: 13.91090 ; Max: 15.44590 ; Range:  1.53500 ; Comp to ref:   0.19%
2: Samples: 80  ; Ave: 14.78155  ; Var:  0.18818  ; S Dev:  0.43380 ; Min: 13.70930 ; Max: 15.49300 ; Range:  1.78370 ; Comp to ref:   0.07%
3: Samples: 80  ; Ave: 14.73304  ; Var:  0.23261  ; S Dev:  0.48229 ; Min: 13.71890 ; Max: 15.50980 ; Range:  1.79090 ; Comp to ref:  -0.26%

20.04 New install on a USB stick:
1: Samples: 80  ; Ave: 15.10526  ; Var:  0.14925  ; S Dev:  0.38633 ; Min: 14.26160 ; Max: 15.69700 ; Range:  1.43540 ; Comp to ref:   2.26%
2: Samples: 80  ; Ave: 15.09465  ; Var:  0.19885  ; S Dev:  0.44593 ; Min: 14.03570 ; Max: 15.85620 ; Range:  1.82050 ; Comp to ref:   2.19%
3: Samples: 80  ; Ave: 15.10710  ; Var:  0.22474  ; S Dev:  0.47406 ; Min: 14.07080 ; Max: 15.75000 ; Range:  1.67920 ; Comp to ref:   2.27%
I had forgotten to disable the thermald service (I use the TCC offset method)
4: Samples: 80  ; Ave: 15.01319  ; Var:  0.22550  ; S Dev:  0.47487 ; Min: 13.97020 ; Max: 15.82430 ; Range:  1.85410 ; Comp to ref:   1.64%

22.04 nvme:
Samples: 80  ; Ave: 15.14745  ; Var:  0.18045  ; S Dev:  0.42479 ; Min: 14.22450 ; Max: 15.82330 ; Range:  1.59880 ; Comp to ref:   2.55%
Samples: 80  ; Ave: 15.06756  ; Var:  0.20224  ; S Dev:  0.44971 ; Min: 13.93580 ; Max: 15.77430 ; Range:  1.83850 ; Comp to ref:   2.01%
Samples: 80  ; Ave: 14.96509  ; Var:  0.30379  ; S Dev:  0.55117 ; Min: 13.78340 ; Max: 15.85540 ; Range:  2.07200 ; Comp to ref:   1.31%
Samples: 80  ; Ave: 15.01513  ; Var:  0.30084  ; S Dev:  0.54849 ; Min: 14.05970 ; Max: 15.91010 ; Range:  1.85040 ; Comp to ref:   1.65%

24.04 Extrenal SSD via USB:
Samples: 80  ; Ave: 17.82366  ; Var:  0.47915  ; S Dev:  0.69220 ; Min: 16.54780 ; Max: 19.01780 ; Range:  2.47000 ; Comp to ref:  20.67%
Samples: 80  ; Ave: 17.87829  ; Var:  0.32942  ; S Dev:  0.57395 ; Min: 16.90720 ; Max: 18.88390 ; Range:  1.97670 ; Comp to ref:  21.04%
Samples: 80  ; Ave: 17.83994  ; Var:  0.53958  ; S Dev:  0.73456 ; Min: 16.21230 ; Max: 18.90000 ; Range:  2.68770 ; Comp to ref:  20.78%
Samples: 80  ; Ave: 17.77135  ; Var:  0.51784  ; S Dev:  0.71961 ; Min: 16.34050 ; Max: 18.93030 ; Range:  2.58980 ; Comp to ref:  20.31%

24.04 nvme:
Samples: 80  ; Ave: 17.77688  ; Var:  0.63086  ; S Dev:  0.79427 ; Min: 15.66420 ; Max: 18.95460 ; Range:  3.29040 ; Comp to ref:  20.35%
Samples: 80  ; Ave: 17.87418  ; Var:  0.60490  ; S Dev:  0.77775 ; Min: 15.75700 ; Max: 18.97980 ; Range:  3.22280 ; Comp to ref:  21.01%
Samples: 80  ; Ave: 18.15530  ; Var:  0.23747  ; S Dev:  0.48731 ; Min: 16.56970 ; Max: 18.82590 ; Range:  2.25620 ; Comp to ref:  22.91%
Samples: 80  ; Ave: 17.67824  ; Var:  0.60164  ; S Dev:  0.77566 ; Min: 16.15170 ; Max: 18.92810 ; Range:  2.77640 ; Comp to ref:  19.68%
Samples: 80  ; Ave: 17.78472  ; Var:  0.49599  ; S Dev:  0.70426 ; Min: 16.34030 ; Max: 18.83070 ; Range:  2.49040 ; Comp to ref:  20.40%
Samples: 80  ; Ave: 17.50397  ; Var:  0.57074  ; S Dev:  0.75548 ; Min: 16.68300 ; Max: 18.76380 ; Range:  2.08080 ; Comp to ref:  18.50%
Samples: 80  ; Ave: 17.83500  ; Var:  0.51942  ; S Dev:  0.72071 ; Min: 16.43420 ; Max: 18.75670 ; Range:  2.32250 ; Comp to ref:  20.74%
Samples: 80  ; Ave: 17.86489  ; Var:  0.46474  ; S Dev:  0.68172 ; Min: 16.47570 ; Max: 18.82470 ; Range:  2.34900 ; Comp to ref:  20.94%

arighi · February 7, 2024, 9:48am

For what I see from the syscall trace and vmstat the 20.04 case is doing more syscalls than the 24.04 case, confirmed by the bigger counters and also by the bigger user cpu usage %, therefore it performs better.

Now the point is, why is it doing more syscalls? Considering that the cpu time is either user or sys (no idle, no waiting for IO) I would expect to see a bigger time distribution of the individual sys_(write|read) invocations.

To make sure this is the case, I would look at this first (20.04 vs 24.04 while the test is running):

$ sudo funclatency-bpfcc "__x64_sys_(write|read)"

The time distribution should confirm that read/write syscalls are actually more expensive in one case vs the other.

If that’s the case we can go deeper and try to figure out why they’re more expensive. For this we could use ftrace to check all the functions that sys_write|read are calling. For example, let’s focus at the writes first:

$ sudo funcgraph-perf -m5 __x64_sys_write

This is going to spit a lot of data, so tune -m accordingly, maybe start with -m2 for example, and increase it to go deeper in the call graph. See if you notice anything different in the call graph, like one case entering a code path that the other case doesn’t have. Do this both for reads and writes.

Then if you notice different code paths, looking at the kernel function names we should be able to figure out what is causing that difference (i.e., a different user-space setting enforced by systemd, a different config in /etc, or something else).

dsmythies · February 8, 2024, 4:07pm

Thanks for the suggested "how to"s. While I have used trace lots, I have never used ftrace before.

I could not find funcgraph-perf, but did end up doing a git clone of perf-tools which gave me funcgraph which seems to do what is desired. As a side note, I prefer to use kernel utilities directly, and have been working towards eliminating the need to use funcgraph.

When using trace, and now ftrace, and in an attempt to generate a lot less data, I use the different test. The test is a single ping-pong pairs with forced cross core CPU affinity. Even so, and as Andrea had warned, there is a ton of trace data generated. I ended up using this command:

sudo ./funcgraph -d 0.001 -m5 schedule > ../../24-04-sched-02.txt

where I had previously allocated a ton of memory to the trace buffer:

echo 1500000 | sudo tee /sys/kernel/debug/tracing/buffer_size_kb

And also previously setting a cpu mask to only trace one of the 2 cpus used for the test:

root@s19:/sys/kernel/debug/tracing# echo 010 > tracing_cpumask

Anyway, it was revealed that a call to schedule during read was the majority of the problem. Drilling down into schedule reveals that on 20.04 dequeue_entity is called once, whereas on 24.04 it is called 3 times. Also put_prev_entity is called 3 times verses 1 or 2 times in 24.04 verses 20.04.

Example listings follow:

20.04:

  4)               |      dequeue_entity() {
  4)               |        update_curr() {
  4)   0.097 us    |          __calc_delta.constprop.0();
  4)   0.089 us    |          update_min_vruntime();
  4)   0.439 us    |        }
  4)   0.100 us    |        __update_load_avg_se();
  4)   0.092 us    |        __update_load_avg_cfs_rq();
  4)   0.098 us    |        avg_vruntime();
  4)   0.092 us    |        __calc_delta.constprop.0();
  4)               |        update_cfs_group() {
  4)   0.100 us    |          reweight_entity();
  4)   0.277 us    |        }
  4)   0.092 us    |        update_min_vruntime();
  4)   1.855 us    |      }
  4)   0.090 us    |      hrtick_update();
  4)   4.026 us    |    }

24.04:

  4)               |      dequeue_entity() {
  4)               |        update_curr() {
  4)   0.097 us    |          __calc_delta.constprop.0();
  4)   0.091 us    |          update_min_vruntime();
  4)   0.448 us    |        }
  4)   0.101 us    |        __update_load_avg_se();
  4)   0.096 us    |        __update_load_avg_cfs_rq();
  4)   0.097 us    |        avg_vruntime();
  4)   0.094 us    |        __calc_delta.constprop.0();
  4)               |        update_cfs_group() {
  4)   0.115 us    |          reweight_entity();
  4)   0.296 us    |        }
  4)   0.092 us    |        update_min_vruntime();
  4)   1.890 us    |      }
  4)               |      dequeue_entity() {
  4)               |        update_curr() {
  4)   0.094 us    |          __calc_delta.constprop.0();
  4)   0.093 us    |          update_min_vruntime();
  4)   0.458 us    |        }
  4)   0.096 us    |        __update_load_avg_se();
  4)   0.096 us    |        __update_load_avg_cfs_rq();
  4)   0.101 us    |        avg_vruntime();
  4)   0.094 us    |        __calc_delta.constprop.0();
  4)               |        update_cfs_group() {
  4)   0.104 us    |          reweight_entity();
  4)   0.281 us    |        }
  4)   0.095 us    |        update_min_vruntime();
  4)   1.951 us    |      }
  4)               |      dequeue_entity() {
  4)               |        update_curr() {
  4)   0.096 us    |          __calc_delta.constprop.0();
  4)   0.092 us    |          update_min_vruntime();
  4)   0.461 us    |        }
  4)   0.096 us    |        __update_load_avg_se();
  4)   0.097 us    |        __update_load_avg_cfs_rq();
  4)   0.101 us    |        avg_vruntime();
  4)   0.096 us    |        __calc_delta.constprop.0();
  4)               |        update_cfs_group() {
  4)   0.117 us    |          reweight_entity();
  4)   0.298 us    |        }
  4)   0.089 us    |        update_min_vruntime();
  4)   1.914 us    |      }
  4)   0.092 us    |      hrtick_update();
  4)   8.152 us    |    }

20.04:

  4)               |      put_prev_task_fair() {
  4)               |        put_prev_entity() {
  4)   0.089 us    |          check_cfs_rq_runtime();
  4)   0.274 us    |        }
  4)               |        put_prev_entity() {
  4)   0.088 us    |          check_cfs_rq_runtime();
  4)   0.255 us    |        }
  4)   0.795 us    |      }
  4)               |      pick_next_task_idle() {
  4)               |        __update_idle_core() {
  4)   0.086 us    |          __rcu_read_lock();
  4)   0.094 us    |          __rcu_read_unlock();
  4)   0.420 us    |        }
  4)   0.591 us    |      }
  4)   2.516 us    |    }

24.04 (oh, it’s 4 calls this time):

  4)               |      put_prev_task_fair() {
  4)               |        put_prev_entity() {
  4)   0.094 us    |          check_cfs_rq_runtime();
  4)   0.269 us    |        }
  4)               |        put_prev_entity() {
  4)   0.093 us    |          check_cfs_rq_runtime();
  4)   0.263 us    |        }
  4)               |        put_prev_entity() {
  4)   0.090 us    |          check_cfs_rq_runtime();
  4)   0.259 us    |        }
  4)               |        put_prev_entity() {
  4)   0.092 us    |          check_cfs_rq_runtime();
  4)   0.263 us    |        }
  4)   1.477 us    |      }
  4)               |      pick_next_task_idle() {
  4)               |        __update_idle_core() {
  4)   0.090 us    |          __rcu_read_lock();
  4)   0.090 us    |          __rcu_read_unlock();
  4)   0.427 us    |        }
  4)   0.600 us    |      }
  4)   3.244 us    |    }

So, in the fair.c code:

/*
 * The dequeue_task method is called before nr_running is
 * decreased. We remove the task from the rbtree and
 * update the fair scheduling stats:
 */
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se;
        int task_sleep = flags & DEQUEUE_SLEEP;
        int idle_h_nr_running = task_has_idle_policy(p);
        bool was_sched_idle = sched_idle_rq(rq);

        util_est_dequeue(&rq->cfs, p);

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                dequeue_entity(cfs_rq, se, flags);

                cfs_rq->h_nr_running--;
...

It seems there are 2 more sched_entity(se) items on 24.04 verses 20.04. I don’t know why. I didn’t look at put_prev_entity yet, but likely something similar.

arighi · February 9, 2024, 10:48am

Hm… are you sure that cgroups are disabled? How does cat /proc/self/cgroup look like in the shell session where you start your benchmark?

You may have more sched entities due to a deeper cgroup hierarchy…

dsmythies · February 11, 2024, 2:56pm

I do not know if all cgroup stuff is disabled or not. I do see this in kern.log:

Feb 10 14:17:58 s19 kernel: [    0.041129] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.6.0-14-generic root=UUID=0ac356c1-caa9-4c2e-8229-4408bd998dbd ro ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo
Feb 10 14:17:58 s19 kernel: [    0.041246] cgroup: Disabling memory control group subsystem
Feb 10 14:17:58 s19 kernel: [    0.041255] cgroup: Disabling pressure control group feature
Feb 10 14:17:58 s19 kernel: [    0.156690] cgroup: Disabling cpuset control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156695] cgroup: Disabling cpu control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156697] cgroup: Disabling cpuacct control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156700] cgroup: Disabling io control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156714] cgroup: Disabling devices control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156716] cgroup: Disabling freezer control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156718] cgroup: Disabling net_cls control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156722] cgroup: Disabling perf_event control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156724] cgroup: Disabling net_prio control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156730] cgroup: Disabling hugetlb control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156732] cgroup: Disabling pids control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156734] cgroup: Disabling rdma control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156735] cgroup: Disabling misc control group subsystem in v1 mounts

For 20.04 I get:

$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

For 24.04 I get:

$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

I can observe differences in this stuff, but I don’t know what might be relevant and what would be a waste of time digging deeper into.

For examples the cgroup features are the same between 20.04 and 24.04:

$ cat /sys/kernel/cgroup/features
nsdelegate
favordynmods
memory_localevents
memory_recursiveprot

But memory_recursiveprot doesn’t actually seem to be used on 20.04 (edited from outputs saved from $ grep . /proc/self/*):

20-proc-self.txt:/proc/self/mountinfo:34 25 0:29 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:9 - cgroup2 cgroup2 rw,nsdelegate
20-proc-self.txt:/proc/self/mounts:cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
24-cgroup-features.txt:nsdelegate
24-proc-self.txt:/proc/self/mountinfo:35 24 0:30 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:10 - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
24-proc-self.txt:/proc/self/mounts:cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0

Is it relevant? I don’t know. How do I disable it in 24.04 for a test? I don’t know.

I have also been looking into differences with udev stuff.

arighi · February 11, 2024, 5:50pm

memory_recursiveprot seems to be set by systemd:
https://github.com/systemd/systemd/blob/main/NEWS#L5433

And it seems to hard-code this option in the mount options for cgroupv2:
https://github.com/systemd/systemd/blob/main/src/shared/mount-setup.c#L104

Maybe we can try to convince systemd to use cgroup-v1? IIRC you need to boot with systemd.unified_cgroup_hierarchy=0.

Another thing that may be worth checking is the scheduler settings are also different, considering the different behavior at calling put_prev_entity():

grep . /proc/sys/kernel/sched*