24.04 considerably slower than 20.04 or 22.04 for some high system percentage usage cases

dsmythies · February 2, 2024, 5:38am

The kernel compiled with the configuration of my previous post made no difference with respect to the throughput degradation issue between 20.04 and 24.04.

I also tried disabling apparmor on both 20.04 and 24.04

24.04 is installed on an external SSD via a sata to USB adapter. 20.04 is installed on an internal nvme drive. I tried mounting the 24.04 drive while running 20.04. It made no difference. Note that there is no drive IO during the test anyhow, and the ping pong token is passed via named pipes in /dev/shm files.

arighi · February 2, 2024, 6:50am

You just need to boot with “cgroup_disable=memory”, that is likely to be the one that has most impact on performance, you don’t have to recompile the kernel. Also in your config diff you haven’t disabled the memory cgroup subsystem.

ogra · February 2, 2024, 12:30pm

You are aware that a USB attached rotary disk is significantly slower (due to USB I/O throughput, mechanical slowdown due to the rotary disk, etc) than a direct pci attached NVME, right ?

You should really run both OSes from the same disk setup when trying to compare… do a 20.04 install on the USB disk and check the speed…

dsmythies · February 2, 2024, 3:32pm

Thanks for the “how to”. The cgroup_disable=memory made no difference. I did confirm the directive worked:

2024-02-02T15:02:41.540483+00:00 s19 kernel: cgroup: Disabling memory control group subsystem

dsmythies · February 2, 2024, 3:36pm

Thanks for chiming in. Yes, I am aware of the differences, but it was/is beyond my comprehensions as to why it would matter since there is no disk IO involved in the test. Anyway, as we are running out of ideas, I’ll try the test but a different way, using another nvme drive. It’ll take me a few days.

arighi · February 3, 2024, 11:28am

Ah… I totally missed that one distro is on an external SSD and the other is on USB. You may not do I/O but your libraries, binaries, etc. may be paged out and if they page fault later on while you run the test, that would cause I/O and it can definitely affect performance. Maybe keep track of the I/O activity while your test is running.

dsmythies · February 6, 2024, 3:23pm

There is very little IO during these tests. Here is some VM stat monitoring during test runs:

doug@s19:~/idle/perf/results/20-24-compare/no-hwp/teo/pp40$ cat io-example.txt
2024.02.03 shows pretty much no I/O during the test:

OS: 20.04

doug@s19:~/c$ vmstat --unit M -n 15
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0  27257   2129   2173    0    0     6    22   16   70  0  2 98  0  0
 0  0      0  27257   2129   2173    0    0     0     1   55   77  0  0 100  0  0
 0  0      0  27257   2129   2173    0    0     0     0   55   89  0  0 100  0  0
 0  0      0  27257   2129   2173    0    0     0     0   53   73  0  0 100  0  0
42  0      0  27254   2129   2173    0    0     0     0 2911 4469020  8 87  4  0  0
46  0      0  27254   2129   2173    0    0     0     0 3055 4697276  9 91  0  0  0
40  0      0  27254   2129   2173    0    0     0     0 3066 4690203  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3014 4690198  8 92  0  0  0
40  0      0  27254   2129   2173    0    0     0     0 3039 4695277  9 91  0  0  0
40  0      0  27254   2129   2173    0    0     0     0 3041 4705429  8 92  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3015 4702999  9 91  0  0  0
43  0      0  27254   2129   2173    0    0     0     0 3046 4704605  8 92  0  0  0
46  0      0  27254   2129   2173    0    0     0     0 3074 4695933  9 91  0  0  0
45  0      0  27254   2129   2173    0    0     0     0 3015 4707250  9 91  0  0  0
46  0      0  27254   2129   2173    0    0     0     0 3039 4706756  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3047 4700811  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3014 4705889  8 92  0  0  0
46  0      0  27254   2129   2173    0    0     0     1 3041 4702034  9 91  0  0  0
44  0      0  27254   2129   2173    0    0     0     0 3043 4698406  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3022 4702671  9 91  0  0  0
41  0      0  27254   2129   2173    0    0     0     0 3039 4707017  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3042 4702490  8 92  0  0  0
45  0      0  27254   2129   2173    0    0     0     0 3016 4699933  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3057 4696263  9 91  0  0  0
41  0      0  27254   2129   2173    0    0     0     0 3042 4694493  9 91  0  0  0
44  0      0  27254   2129   2173    0    0     0     0 3015 4701169  9 91  0  0  0
41  0      0  27254   2129   2173    0    0     0     0 3040 4702579  8 92  0  0  0
45  0      0  27254   2129   2173    0    0     0     0 3049 4702101  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3015 4704366  9 91  0  0  0
41  0      0  27254   2129   2173    0    0     0     0 3042 4699218  8 92  0  0  0
44  0      0  27254   2129   2173    0    0     0     0 3041 4692234  9 91  0  0  0
43  0      0  27254   2129   2173    0    0     0     0 3022 4696061  9 91  0  0  0
42  0      0  27254   2129   2173    0    0     0     0 3042 4701076  9 91  0  0  0
45  0      0  27254   2129   2173    0    0     0     0 3042 4702727  9 91  0  0  0
39  0      0  27254   2129   2173    0    0     0     0 3017 4711968  9 91  0  0  0
35  0      0  27254   2129   2173    0    0     0     0 3150 4811568  9 91  0  0  0
24  0      0  27262   2129   2173    0    0     0     0 3293 5020920  8 92  0  0  0
 5  0      0  27269   2129   2173    0    0     0     0 97609 4852769  7 76 17  0  0
 0  0      0  27277   2129   2173    0    0     0     0 12045 1692706  1 12 87  0  0
 0  0      0  27277   2129   2173    0    0     0     0   56   88  0  0 100  0  0
 0  0      0  27277   2129   2173    0    0     0     0   21   38  0  0 100  0  0


OS: 24.04

doug@s19:/media/nvme/home/doug/idle/perf/results/20-24-compare/no-hwp/teo/pp40$ vmstat --unit M -n 15
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 0  0      0  31235     75    356    0    0   874    56  669    0  0  0 100  0  0  0
 0  0      0  31235     75    356    0    0     0     0   49   73  0  0 100  0  0  0
 0  0      0  31235     75    356    0    0     0     4   56   85  0  0 100  0  0  0
 0  0      0  31235     75    356    0    0     0     0   21   35  0  0 100  0  0  0
41  0      0  31225     75    357    0    0    56    12 2827 3609278  7 85  8  0  0  0
41  0      0  31225     75    357    0    0     0     1 3043 3911709  7 93  0  0  0  0
41  0      0  31225     75    357    0    0     0     1 3014 3915416  7 93  0  0  0  0
47  0      0  31225     75    357    0    0     0     0 3039 3912586  7 93  0  0  0  0
45  0      0  31225     75    357    0    0     0     0 3048 3912526  7 93  0  0  0  0
44  0      0  31225     75    357    0    0     0     1 3014 3914000  7 93  0  0  0  0
44  0      0  31225     75    357    0    0     0     0 3040 3911553  8 92  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3038 3911487  7 93  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3173 3915021  7 93  0  0  0  0
46  0      0  31225     75    357    0    0     0     1 3913 3911987  6 94  0  0  0  0
42  0      0  31225     75    357    0    0     0     0 3515 3913231  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     0 3075 3915327  8 92  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3039 3911330  7 93  0  0  0  0
44  0      0  31225     75    357    0    0     0     1 3050 3910741  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     0 3016 3915956  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     0 3037 3911342  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     2 3044 3909436  8 92  0  0  0  0
44  0      0  31225     75    357    0    0     0     1 3023 3914635  7 93  0  0  0  0
42  0      0  31225     75    357    0    0     0     0 3042 3909317  7 93  0  0  0  0
44  0      0  31225     75    357    0    0     0     0 3038 3908718  7 93  0  0  0  0
42  0      0  31225     75    357    0    0     0     0 3014 3910827  7 93  0  0  0  0
46  0      0  31225     75    357    0    0     0     0 3049 3912835  7 93  0  0  0  0
40  0      0  31225     75    357    0    0     0     1 3044 3906396  8 92  0  0  0  0
43  0      0  31225     75    357    0    0     0     0 3012 3911503  8 92  0  0  0  0
42  0      0  31225     75    357    0    0     0     0 3041 3908481  7 93  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3048 3910543  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     0 3014 3921357  7 93  0  0  0  0
43  0      0  31225     75    357    0    0     0     1 3041 3905356  7 93  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3040 3905252  8 92  0  0  0  0
46  0      0  31225     75    357    0    0     0     0 3022 3908858  8 92  0  0  0  0
46  0      0  31225     75    357    0    0     0     0 3040 3906519  7 93  0  0  0  0
42  0      0  31225     75    357    0    0     0     1 3040 3905895  8 92  0  0  0  0
42  0      0  31225     75    357    0    0     0     0 3015 3910260  7 93  0  0  0  0
40  0      0  31225     75    357    0    0     0     0 3048 3906123  7 93  0  0  0  0
42  0      0  31223     76    357    0    0    30    13 3070 3908430  7 93  0  0  0  0
42  0      0  31223     76    357    0    0     0     0 3013 3911228  7 93  0  0  0  0
34  0      0  31223     76    357    0    0     0     1 3047 3928800  7 93  0  0  0  0
31  0      0  31223     76    357    0    0     0     1 3054 4049397  7 93  0  0  0  0
30  0      0  31223     76    357    0    0     0     0 3019 4101428  7 93  0  0  0  0
18  0      0  31223     76    357    0    0     0     0 8242 4397009  7 93  0  0  0  0
 2  0      0  31223     76    357    0    0     0     0 41870 3875590  5 53 42  0  0  0
 0  0      0  31223     76    357    0    0     0     1 1696 1324425  1  9 91  0  0  0
 0  0      0  31223     76    357    0    0     0     0   49   74  0  0 100  0  0  0
 0  0      0  31223     76    357    0    0     0     1   50   74  0  0 100  0  0  0
 0  0      0  31223     76    357    0    0     0     0   46   69  0  0 100  0  0  0
 0  0      0  31223     76    357    0    0     0     0   57   90  0  0 100  0  0  0

OS: 20.04 via USB stick:

doug@s19:~$ vmstat --unit M -n 15
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0  31136     49    435    0    0    12     0  111  352  3 31 66  0  0
 0  0      0  31136     49    435    0    0     0     1   38   62  0  0 100  0  0
 0  0      0  31136     49    435    0    0     0     0   51   77  0  0 100  0  0
44  0      0  31125     49    435    0    0     0     0  101 27715  0  1 99  0  0
45  0      0  31125     49    435    0    0     0     0 3040 4568272  9 91  0  0  0
42  0      0  31125     49    435    0    0     0     0 3023 4576509  8 92  0  0  0
46  0      0  31125     49    435    0    0     0     0 3041 4568897  8 92  0  0  0
43  0      0  31125     49    435    0    0     0     0 3041 4572075  9 91  0  0  0
40  0      0  31125     49    435    0    0     0     0 3016 4575675  8 92  0  0  0
40  0      0  31125     49    435    0    0     0     0 3047 4568539  8 92  0  0  0
40  0      0  31125     49    435    0    0     0     0 3041 4572851  9 91  0  0  0
41  0      0  31125     49    435    0    0     0     0 3015 4578205  8 92  0  0  0
42  0      0  31125     49    435    0    0     0     0 3042 4574618  8 92  0  0  0
42  0      0  31125     49    435    0    0     0     0 3048 4575378  8 92  0  0  0
40  0      0  31125     49    435    0    0     0     0 3016 4570631  8 92  0  0  0
44  0      0  31125     49    435    0    0     0     0 3042 4576007  8 92  0  0  0
44  0      0  31125     49    435    0    0     0     0 3045 4575002  8 92  0  0  0
42  0      0  31125     49    435    0    0     0     1 3028 4578231  9 91  0  0  0
44  0      0  31125     49    435    0    0     0     0 3041 4570888  8 92  0  0  0
41  0      0  31125     49    435    0    0     0     0 3041 4570848  8 92  0  0  0
44  0      0  31125     49    435    0    0     0     0 3016 4573795  8 92  0  0  0
42  0      0  31125     49    435    0    0     0     0 3051 4572225  8 92  0  0  0
44  0      0  31118     49    435    0    0     0    20 3060 4565321  8 92  0  0  0
44  0      0  31118     49    435    0    0     0     4 3020 4574698  8 92  0  0  0
44  0      0  31118     49    435    0    0     0     0 3039 4570885  9 91  0  0  0
42  0      0  31118     49    435    0    0     0     0 3049 4567227  8 92  0  0  0
45  0      0  31118     49    435    0    0     0     0 3016 4575568  8 92  0  0  0
40  0      0  31118     49    435    0    0     0     0 3043 4573230  9 91  0  0  0
45  0      0  31118     49    435    0    0     0     0 3043 4574706  8 92  0  0  0
42  0      0  31118     49    435    0    0     0     0 3023 4569284  8 92  0  0  0
42  0      0  31118     49    435    0    0     0     0 3041 4571018  9 91  0  0  0
40  0      0  31118     49    435    0    0     0     0 3041 4571103  8 92  0  0  0
42  0      0  31118     49    435    0    0     0     0 3015 4571553  8 92  0  0  0
40  0      0  31118     49    435    0    0     0     0 3039 4569022  8 92  0  0  0
44  0      0  31118     49    435    0    0     0     0 3048 4574169  8 92  0  0  0
36  0      0  31118     49    435    0    0     0     0 3023 4598446  9 91  0  0  0
32  0      0  31118     49    435    0    0     0     0 3116 4728334  8 92  0  0  0
25  0      0  31118     49    435    0    0     0     0 3093 4830644  8 92  0  0  0
 4  0      0  31141     49    435    0    0     0     0 54657 4599783  6 68 26  0  0
 0  0      0  31141     49    435    0    0     0     0 1661 1143147  1  7 92  0  0
 0  0      0  31141     49    435    0    0     0     0   49   75  0  0 100  0  0
 0  0      0  31141     49    435    0    0     0     0   24   42  0  0 100  0  0

OS: 24.04 internal nvme drive

doug@s19:/media/nvme/home/doug/idle/perf/results/20-24-compare/no-hwp/teo/pp40$ vmstat --unit M -n 15
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 1  0      0  31041    112    500    0    0    26     5 1105  898  2 24 74  0  0  0
 0  0      0  31040    112    500    0    0     0     3   51   62  0  0 100  0  0  0
 0  0      0  31039    112    500    0    0     0     1   67  103  0  0 100  0  0  0
 0  0      0  31039    112    500    0    0     0     2   49   66  0  0 100  0  0  0
 0  0      0  31039    112    500    0    0     0     6   18   27  0  0 100  0  0  0
 0  0      0  31039    112    500    0    0     0     3   49   66  0  0 100  0  0  0
 0  0      0  31039    112    500    0    0     0     0   53   79  0  0 100  0  0  0
43  0      0  31029    112    500    0    0     0     1 2341 3279040  6 71 23  0  0  0
41  0      0  31029    112    500    0    0     0     1 3039 4271810  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3039 4271056  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3021 4275941  8 92  0  0  0  0
43  0      0  31029    112    500    0    0     0     0 3040 4272622  8 92  0  0  0  0
43  0      0  31029    112    500    0    0     0     0 3041 4271104  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     1 3013 4274997  8 92  0  0  0  0
41  0      0  31029    112    500    0    0     0     0 3051 4289365  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3040 4287240  8 92  0  0  0  0
43  0      0  31029    112    500    0    0     0     0 3013 4277850  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     1 3039 4277966  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3039 4274327  8 92  0  0  0  0
46  0      0  31029    112    500    0    0     0     1 3022 4276595  8 92  0  0  0  0
40  0      0  31029    112    500    0    0     0     0 3040 4274199  8 92  0  0  0  0
43  0      0  31029    112    500    0    0     0     0 3039 4276948  8 92  0  0  0  0
45  0      0  31029    112    500    0    0     0     0 3016 4279941  8 92  0  0  0  0
40  0      0  31029    112    500    0    0     0     1 3046 4273451  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3039 4281308  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3013 4273600  8 92  0  0  0  0
44  0      0  31029    112    500    0    0     0     0 3040 4272102  8 92  0  0  0  0
43  0      0  31029    112    500    0    0     0     0 3047 4274355  8 92  0  0  0  0
40  0      0  31029    112    500    0    0     0     1 3013 4278047  8 92  0  0  0  0
45  0      0  31029    112    500    0    0     0     0 3042 4275542  8 92  0  0  0  0
45  0      0  31029    112    500    0    0     0     0 3039 4277716  8 92  0  0  0  0
41  0      0  31029    112    500    0    0     0     0 3021 4280635  8 92  0  0  0  0
44  0      0  31029    112    500    0    0     0     0 3040 4277454  8 92  0  0  0  0
45  0      0  31029    112    500    0    0     0     1 3039 4276449  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3014 4280708  8 92  0  0  0  0
45  0      0  31029    112    500    0    0     0     0 3047 4276838  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3039 4275901  8 92  0  0  0  0
48  0      0  31029    112    500    0    0     0     1 3015 4277549  8 92  0  0  0  0
48  0      0  31029    112    500    0    0     0     0 3039 4276942  8 92  0  0  0  0
42  0      0  31029    112    500    0    0     0     0 3047 4276270  8 92  0  0  0  0
38  0      0  31029    112    500    0    0     0     0 3016 4314929  8 92  0  0  0  0
32  0      0  31029    112    500    0    0     0     0 3042 4458087  7 93  0  0  0  0
27  0      0  31029    112    500    0    0     0     1 3044 4576553  7 93  0  0  0  0
15  0      0  31029    112    500    0    0     0     0 9537 4837276  7 93  0  0  0  0
 0  0      0  31029    112    500    0    0     0     0 36472 2771113  3 32 65  0  0  0
 0  0      0  31029    112    500    0    0     0     0   45   62  0  0 100  0  0  0
 0  0      0  31029    112    500    0    0     0     0   18   27  0  0 100  0  0  0
 0  0      0  31029    112    500    0    0     0     1   55   80  0  0 100  0  0  0
 0  0      0  31029    112    500    0    0     0     0   46   62  0  0 100  0  0  0

I did several tests for boot drive location: Original 20.04 on internal nvme drive; 22.04 on external nvme via USB; 24.04 on external SSD via USB; 22.04 on internal nvme; 20.04 on external USB stick; 24.04 on internal nvme.
The test is 40 ping pong pairs, no work per token stop, 30 million loops. The numbers are uSec per loop and less is better.
A subset of the comparison data, kernel 6.6.0-14:

More cgroup stuff: cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all

20.04 nvme: individual runs that result in the "master" average (14.77114333):
1: Samples: 80  ; Ave: 14.79884  ; Var:  0.17892  ; S Dev:  0.42298 ; Min: 13.91090 ; Max: 15.44590 ; Range:  1.53500 ; Comp to ref:   0.19%
2: Samples: 80  ; Ave: 14.78155  ; Var:  0.18818  ; S Dev:  0.43380 ; Min: 13.70930 ; Max: 15.49300 ; Range:  1.78370 ; Comp to ref:   0.07%
3: Samples: 80  ; Ave: 14.73304  ; Var:  0.23261  ; S Dev:  0.48229 ; Min: 13.71890 ; Max: 15.50980 ; Range:  1.79090 ; Comp to ref:  -0.26%

20.04 New install on a USB stick:
1: Samples: 80  ; Ave: 15.10526  ; Var:  0.14925  ; S Dev:  0.38633 ; Min: 14.26160 ; Max: 15.69700 ; Range:  1.43540 ; Comp to ref:   2.26%
2: Samples: 80  ; Ave: 15.09465  ; Var:  0.19885  ; S Dev:  0.44593 ; Min: 14.03570 ; Max: 15.85620 ; Range:  1.82050 ; Comp to ref:   2.19%
3: Samples: 80  ; Ave: 15.10710  ; Var:  0.22474  ; S Dev:  0.47406 ; Min: 14.07080 ; Max: 15.75000 ; Range:  1.67920 ; Comp to ref:   2.27%
I had forgotten to disable the thermald service (I use the TCC offset method)
4: Samples: 80  ; Ave: 15.01319  ; Var:  0.22550  ; S Dev:  0.47487 ; Min: 13.97020 ; Max: 15.82430 ; Range:  1.85410 ; Comp to ref:   1.64%

22.04 nvme:
Samples: 80  ; Ave: 15.14745  ; Var:  0.18045  ; S Dev:  0.42479 ; Min: 14.22450 ; Max: 15.82330 ; Range:  1.59880 ; Comp to ref:   2.55%
Samples: 80  ; Ave: 15.06756  ; Var:  0.20224  ; S Dev:  0.44971 ; Min: 13.93580 ; Max: 15.77430 ; Range:  1.83850 ; Comp to ref:   2.01%
Samples: 80  ; Ave: 14.96509  ; Var:  0.30379  ; S Dev:  0.55117 ; Min: 13.78340 ; Max: 15.85540 ; Range:  2.07200 ; Comp to ref:   1.31%
Samples: 80  ; Ave: 15.01513  ; Var:  0.30084  ; S Dev:  0.54849 ; Min: 14.05970 ; Max: 15.91010 ; Range:  1.85040 ; Comp to ref:   1.65%

24.04 Extrenal SSD via USB:
Samples: 80  ; Ave: 17.82366  ; Var:  0.47915  ; S Dev:  0.69220 ; Min: 16.54780 ; Max: 19.01780 ; Range:  2.47000 ; Comp to ref:  20.67%
Samples: 80  ; Ave: 17.87829  ; Var:  0.32942  ; S Dev:  0.57395 ; Min: 16.90720 ; Max: 18.88390 ; Range:  1.97670 ; Comp to ref:  21.04%
Samples: 80  ; Ave: 17.83994  ; Var:  0.53958  ; S Dev:  0.73456 ; Min: 16.21230 ; Max: 18.90000 ; Range:  2.68770 ; Comp to ref:  20.78%
Samples: 80  ; Ave: 17.77135  ; Var:  0.51784  ; S Dev:  0.71961 ; Min: 16.34050 ; Max: 18.93030 ; Range:  2.58980 ; Comp to ref:  20.31%

24.04 nvme:
Samples: 80  ; Ave: 17.77688  ; Var:  0.63086  ; S Dev:  0.79427 ; Min: 15.66420 ; Max: 18.95460 ; Range:  3.29040 ; Comp to ref:  20.35%
Samples: 80  ; Ave: 17.87418  ; Var:  0.60490  ; S Dev:  0.77775 ; Min: 15.75700 ; Max: 18.97980 ; Range:  3.22280 ; Comp to ref:  21.01%
Samples: 80  ; Ave: 18.15530  ; Var:  0.23747  ; S Dev:  0.48731 ; Min: 16.56970 ; Max: 18.82590 ; Range:  2.25620 ; Comp to ref:  22.91%
Samples: 80  ; Ave: 17.67824  ; Var:  0.60164  ; S Dev:  0.77566 ; Min: 16.15170 ; Max: 18.92810 ; Range:  2.77640 ; Comp to ref:  19.68%
Samples: 80  ; Ave: 17.78472  ; Var:  0.49599  ; S Dev:  0.70426 ; Min: 16.34030 ; Max: 18.83070 ; Range:  2.49040 ; Comp to ref:  20.40%
Samples: 80  ; Ave: 17.50397  ; Var:  0.57074  ; S Dev:  0.75548 ; Min: 16.68300 ; Max: 18.76380 ; Range:  2.08080 ; Comp to ref:  18.50%
Samples: 80  ; Ave: 17.83500  ; Var:  0.51942  ; S Dev:  0.72071 ; Min: 16.43420 ; Max: 18.75670 ; Range:  2.32250 ; Comp to ref:  20.74%
Samples: 80  ; Ave: 17.86489  ; Var:  0.46474  ; S Dev:  0.68172 ; Min: 16.47570 ; Max: 18.82470 ; Range:  2.34900 ; Comp to ref:  20.94%

arighi · February 7, 2024, 9:48am

For what I see from the syscall trace and vmstat the 20.04 case is doing more syscalls than the 24.04 case, confirmed by the bigger counters and also by the bigger user cpu usage %, therefore it performs better.

Now the point is, why is it doing more syscalls? Considering that the cpu time is either user or sys (no idle, no waiting for IO) I would expect to see a bigger time distribution of the individual sys_(write|read) invocations.

To make sure this is the case, I would look at this first (20.04 vs 24.04 while the test is running):

$ sudo funclatency-bpfcc "__x64_sys_(write|read)"

The time distribution should confirm that read/write syscalls are actually more expensive in one case vs the other.

If that’s the case we can go deeper and try to figure out why they’re more expensive. For this we could use ftrace to check all the functions that sys_write|read are calling. For example, let’s focus at the writes first:

$ sudo funcgraph-perf -m5 __x64_sys_write

This is going to spit a lot of data, so tune -m accordingly, maybe start with -m2 for example, and increase it to go deeper in the call graph. See if you notice anything different in the call graph, like one case entering a code path that the other case doesn’t have. Do this both for reads and writes.

Then if you notice different code paths, looking at the kernel function names we should be able to figure out what is causing that difference (i.e., a different user-space setting enforced by systemd, a different config in /etc, or something else).

dsmythies · February 8, 2024, 4:07pm

Thanks for the suggested "how to"s. While I have used trace lots, I have never used ftrace before.

I could not find funcgraph-perf, but did end up doing a git clone of perf-tools which gave me funcgraph which seems to do what is desired. As a side note, I prefer to use kernel utilities directly, and have been working towards eliminating the need to use funcgraph.

When using trace, and now ftrace, and in an attempt to generate a lot less data, I use the different test. The test is a single ping-pong pairs with forced cross core CPU affinity. Even so, and as Andrea had warned, there is a ton of trace data generated. I ended up using this command:

sudo ./funcgraph -d 0.001 -m5 schedule > ../../24-04-sched-02.txt

where I had previously allocated a ton of memory to the trace buffer:

echo 1500000 | sudo tee /sys/kernel/debug/tracing/buffer_size_kb

And also previously setting a cpu mask to only trace one of the 2 cpus used for the test:

root@s19:/sys/kernel/debug/tracing# echo 010 > tracing_cpumask

Anyway, it was revealed that a call to schedule during read was the majority of the problem. Drilling down into schedule reveals that on 20.04 dequeue_entity is called once, whereas on 24.04 it is called 3 times. Also put_prev_entity is called 3 times verses 1 or 2 times in 24.04 verses 20.04.

Example listings follow:

20.04:

  4)               |      dequeue_entity() {
  4)               |        update_curr() {
  4)   0.097 us    |          __calc_delta.constprop.0();
  4)   0.089 us    |          update_min_vruntime();
  4)   0.439 us    |        }
  4)   0.100 us    |        __update_load_avg_se();
  4)   0.092 us    |        __update_load_avg_cfs_rq();
  4)   0.098 us    |        avg_vruntime();
  4)   0.092 us    |        __calc_delta.constprop.0();
  4)               |        update_cfs_group() {
  4)   0.100 us    |          reweight_entity();
  4)   0.277 us    |        }
  4)   0.092 us    |        update_min_vruntime();
  4)   1.855 us    |      }
  4)   0.090 us    |      hrtick_update();
  4)   4.026 us    |    }

24.04:

  4)               |      dequeue_entity() {
  4)               |        update_curr() {
  4)   0.097 us    |          __calc_delta.constprop.0();
  4)   0.091 us    |          update_min_vruntime();
  4)   0.448 us    |        }
  4)   0.101 us    |        __update_load_avg_se();
  4)   0.096 us    |        __update_load_avg_cfs_rq();
  4)   0.097 us    |        avg_vruntime();
  4)   0.094 us    |        __calc_delta.constprop.0();
  4)               |        update_cfs_group() {
  4)   0.115 us    |          reweight_entity();
  4)   0.296 us    |        }
  4)   0.092 us    |        update_min_vruntime();
  4)   1.890 us    |      }
  4)               |      dequeue_entity() {
  4)               |        update_curr() {
  4)   0.094 us    |          __calc_delta.constprop.0();
  4)   0.093 us    |          update_min_vruntime();
  4)   0.458 us    |        }
  4)   0.096 us    |        __update_load_avg_se();
  4)   0.096 us    |        __update_load_avg_cfs_rq();
  4)   0.101 us    |        avg_vruntime();
  4)   0.094 us    |        __calc_delta.constprop.0();
  4)               |        update_cfs_group() {
  4)   0.104 us    |          reweight_entity();
  4)   0.281 us    |        }
  4)   0.095 us    |        update_min_vruntime();
  4)   1.951 us    |      }
  4)               |      dequeue_entity() {
  4)               |        update_curr() {
  4)   0.096 us    |          __calc_delta.constprop.0();
  4)   0.092 us    |          update_min_vruntime();
  4)   0.461 us    |        }
  4)   0.096 us    |        __update_load_avg_se();
  4)   0.097 us    |        __update_load_avg_cfs_rq();
  4)   0.101 us    |        avg_vruntime();
  4)   0.096 us    |        __calc_delta.constprop.0();
  4)               |        update_cfs_group() {
  4)   0.117 us    |          reweight_entity();
  4)   0.298 us    |        }
  4)   0.089 us    |        update_min_vruntime();
  4)   1.914 us    |      }
  4)   0.092 us    |      hrtick_update();
  4)   8.152 us    |    }

20.04:

  4)               |      put_prev_task_fair() {
  4)               |        put_prev_entity() {
  4)   0.089 us    |          check_cfs_rq_runtime();
  4)   0.274 us    |        }
  4)               |        put_prev_entity() {
  4)   0.088 us    |          check_cfs_rq_runtime();
  4)   0.255 us    |        }
  4)   0.795 us    |      }
  4)               |      pick_next_task_idle() {
  4)               |        __update_idle_core() {
  4)   0.086 us    |          __rcu_read_lock();
  4)   0.094 us    |          __rcu_read_unlock();
  4)   0.420 us    |        }
  4)   0.591 us    |      }
  4)   2.516 us    |    }

24.04 (oh, it’s 4 calls this time):

  4)               |      put_prev_task_fair() {
  4)               |        put_prev_entity() {
  4)   0.094 us    |          check_cfs_rq_runtime();
  4)   0.269 us    |        }
  4)               |        put_prev_entity() {
  4)   0.093 us    |          check_cfs_rq_runtime();
  4)   0.263 us    |        }
  4)               |        put_prev_entity() {
  4)   0.090 us    |          check_cfs_rq_runtime();
  4)   0.259 us    |        }
  4)               |        put_prev_entity() {
  4)   0.092 us    |          check_cfs_rq_runtime();
  4)   0.263 us    |        }
  4)   1.477 us    |      }
  4)               |      pick_next_task_idle() {
  4)               |        __update_idle_core() {
  4)   0.090 us    |          __rcu_read_lock();
  4)   0.090 us    |          __rcu_read_unlock();
  4)   0.427 us    |        }
  4)   0.600 us    |      }
  4)   3.244 us    |    }

So, in the fair.c code:

/*
 * The dequeue_task method is called before nr_running is
 * decreased. We remove the task from the rbtree and
 * update the fair scheduling stats:
 */
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se;
        int task_sleep = flags & DEQUEUE_SLEEP;
        int idle_h_nr_running = task_has_idle_policy(p);
        bool was_sched_idle = sched_idle_rq(rq);

        util_est_dequeue(&rq->cfs, p);

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                dequeue_entity(cfs_rq, se, flags);

                cfs_rq->h_nr_running--;
...

It seems there are 2 more sched_entity(se) items on 24.04 verses 20.04. I don’t know why. I didn’t look at put_prev_entity yet, but likely something similar.

arighi · February 9, 2024, 10:48am

Hm… are you sure that cgroups are disabled? How does cat /proc/self/cgroup look like in the shell session where you start your benchmark?

You may have more sched entities due to a deeper cgroup hierarchy…

dsmythies · February 11, 2024, 2:56pm

I do not know if all cgroup stuff is disabled or not. I do see this in kern.log:

Feb 10 14:17:58 s19 kernel: [    0.041129] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.6.0-14-generic root=UUID=0ac356c1-caa9-4c2e-8229-4408bd998dbd ro ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo
Feb 10 14:17:58 s19 kernel: [    0.041246] cgroup: Disabling memory control group subsystem
Feb 10 14:17:58 s19 kernel: [    0.041255] cgroup: Disabling pressure control group feature
Feb 10 14:17:58 s19 kernel: [    0.156690] cgroup: Disabling cpuset control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156695] cgroup: Disabling cpu control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156697] cgroup: Disabling cpuacct control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156700] cgroup: Disabling io control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156714] cgroup: Disabling devices control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156716] cgroup: Disabling freezer control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156718] cgroup: Disabling net_cls control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156722] cgroup: Disabling perf_event control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156724] cgroup: Disabling net_prio control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156730] cgroup: Disabling hugetlb control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156732] cgroup: Disabling pids control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156734] cgroup: Disabling rdma control group subsystem in v1 mounts
Feb 10 14:17:58 s19 kernel: [    0.156735] cgroup: Disabling misc control group subsystem in v1 mounts

For 20.04 I get:

$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

For 24.04 I get:

$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

I can observe differences in this stuff, but I don’t know what might be relevant and what would be a waste of time digging deeper into.

For examples the cgroup features are the same between 20.04 and 24.04:

$ cat /sys/kernel/cgroup/features
nsdelegate
favordynmods
memory_localevents
memory_recursiveprot

But memory_recursiveprot doesn’t actually seem to be used on 20.04 (edited from outputs saved from $ grep . /proc/self/*):

20-proc-self.txt:/proc/self/mountinfo:34 25 0:29 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:9 - cgroup2 cgroup2 rw,nsdelegate
20-proc-self.txt:/proc/self/mounts:cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
24-cgroup-features.txt:nsdelegate
24-proc-self.txt:/proc/self/mountinfo:35 24 0:30 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:10 - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
24-proc-self.txt:/proc/self/mounts:cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0

Is it relevant? I don’t know. How do I disable it in 24.04 for a test? I don’t know.

I have also been looking into differences with udev stuff.

arighi · February 11, 2024, 5:50pm

memory_recursiveprot seems to be set by systemd:
https://github.com/systemd/systemd/blob/main/NEWS#L5433

And it seems to hard-code this option in the mount options for cgroupv2:
https://github.com/systemd/systemd/blob/main/src/shared/mount-setup.c#L104

Maybe we can try to convince systemd to use cgroup-v1? IIRC you need to boot with systemd.unified_cgroup_hierarchy=0.

Another thing that may be worth checking is the scheduler settings are also different, considering the different behavior at calling put_prev_entity():

grep . /proc/sys/kernel/sched*

dsmythies · February 11, 2024, 7:18pm

Okay, now we are getting somewhere. This is the grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp systemd.unified_cgroup_hierarchy=0 cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"

and now on 24.04 I get:

Samples: 80  ; Ave: 15.00202  ; Var:  0.20312  ; S Dev:  0.45069 ; Min: 14.05610 ; Max: 15.70310 ; Range:  1.64700 ; Comp to ref:   1.56%

Instead of the ~20% worse range from the earlier post.

arighi · February 11, 2024, 7:40pm

It really seems to be cgroup related then. Have you tried the same boot options also on 20.04?

dsmythies · February 11, 2024, 9:42pm

The new “Master” reference average, used in an earlier post was on 20.04 on an internal nvme drive with this grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"

That average of a few runs was 14.77114333 uSec per loop. The test is the 40 ping pong pairs, 30,000,000 loops per pair one.

These are the test results using this grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp systemd.unified_cgroup_hierarchy=0 cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"

24.04 on internal nvme drive (4 test runs, with result 1 being a repeat of earlier post):
Samples: 80  ; Ave: 15.00202  ; Var:  0.20312  ; S Dev:  0.45069 ; Min: 14.05610 ; Max: 15.70310 ; Range:  1.64700 ; Comp to ref:   1.56%
Samples: 80  ; Ave: 15.13712  ; Var:  0.19680  ; S Dev:  0.44362 ; Min: 14.00730 ; Max: 15.73140 ; Range:  1.72410 ; Comp to ref:   2.48%
Samples: 80  ; Ave: 14.97392  ; Var:  0.23093  ; S Dev:  0.48056 ; Min: 13.89290 ; Max: 15.70420 ; Range:  1.81130 ; Comp to ref:   1.37%
Samples: 80  ; Ave: 15.14079  ; Var:  0.17549  ; S Dev:  0.41892 ; Min: 14.10390 ; Max: 15.80830 ; Range:  1.70440 ; Comp to ref:   2.50%

20.04 on internal nvme drive (4 test runs):
Samples: 80  ; Ave: 14.83415  ; Var:  0.21711  ; S Dev:  0.46595 ; Min: 13.88660 ; Max: 15.56870 ; Range:  1.68210 ; Comp to ref:   0.43%
Samples: 80  ; Ave: 14.83119  ; Var:  0.21669  ; S Dev:  0.46550 ; Min: 13.82980 ; Max: 15.52110 ; Range:  1.69130 ; Comp to ref:   0.41%
Samples: 80  ; Ave: 15.00895  ; Var:  0.11571  ; S Dev:  0.34017 ; Min: 13.84090 ; Max: 15.60840 ; Range:  1.76750 ; Comp to ref:   1.61%
Samples: 80  ; Ave: 14.92387  ; Var:  0.15777  ; S Dev:  0.39720 ; Min: 14.04810 ; Max: 15.59750 ; Range:  1.54940 ; Comp to ref:   1.03%

dsmythies · February 18, 2024, 6:48pm

For an earlier suggestion:

grep . /proc/sys/kernel/sched*

the results are exactly the same between 20.04 and 24.04:

/proc/sys/kernel/sched_autogroup_enabled:1
/proc/sys/kernel/sched_cfs_bandwidth_slice_us:5000
/proc/sys/kernel/sched_child_runs_first:0
/proc/sys/kernel/sched_deadline_period_max_us:4194304
/proc/sys/kernel/sched_deadline_period_min_us:100
/proc/sys/kernel/sched_energy_aware:1
/proc/sys/kernel/sched_rr_timeslice_ms:100
/proc/sys/kernel/sched_rt_period_us:1000000
/proc/sys/kernel/sched_rt_runtime_us:950000
/proc/sys/kernel/sched_schedstats:0
/proc/sys/kernel/sched_util_clamp_max:1024
/proc/sys/kernel/sched_util_clamp_min:1024
/proc/sys/kernel/sched_util_clamp_min_rt_default:1024

Now that we have isolated the major difference down to systemd.unified_cgroup_hierarchy=0, the average degradation for the 40 pairs ping-pong test for 24.04 is about 16.5%.

I do not know how to investigate further.

dsmythies · February 19, 2024, 11:56pm

It seems mainline kernel 6.5 does not suffer from the main performance degradation issue of this thread, but mainline kernel 6.6-rc1 does. Therefore starting points for a kernel bisection are defined.
I did this latest work on a debian installation and also my build environment is on my 20.04 server. If I am going to bisect the kernel, then I’d want to setup kernel compile ability on the 24.04 server and go from there. It would take me awhile (like a week).

arighi · February 21, 2024, 1:43pm

Wait wait wait… in 6.6 we had the switch from CFS to EEVDF! Basically 6.6 has a totally different scheduler. So, even if the settings are identical it might do something completely different and the cgroup hierarchy can definitely affect the overall performance.

dsmythies · February 21, 2024, 2:58pm

Very interesting. I only have a couple of steps left in the kernel bisection and did notice this:

Bisecting: 3 revisions left to test after this (roughly 2 steps)
[147f3efaa24182a21706bca15eab2f3f4630b5fe] sched/fair: Implement an EEVDF-like scheduling policy

Anyway, at this point I’ll finish the bisection.

EDIT: Indeed, the scheduler change is the issue for this workflow:

doug@s19:~/kernel/linux$ git bisect good
147f3efaa24182a21706bca15eab2f3f4630b5fe is the first bad commit
commit 147f3efaa24182a21706bca15eab2f3f4630b5fe
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed May 31 13:58:44 2023 +0200

    sched/fair: Implement an EEVDF-like scheduling policy

    Where CFS is currently a WFQ based scheduler with only a single knob,
    the weight. The addition of a second, latency oriented parameter,
    makes something like WF2Q or EEVDF based a much better fit.

    Specifically, EEVDF does EDF like scheduling in the left half of the
    tree -- those entities that are owed service. Except because this is a
    virtual time scheduler, the deadlines are in virtual time as well,
    which is what allows over-subscription.

    EEVDF has two parameters:

     - weight, or time-slope: which is mapped to nice just as before

     - request size, or slice length: which is used to compute
       the virtual deadline as: vd_i = ve_i + r_i/w_i

    Basically, by setting a smaller slice, the deadline will be earlier
    and the task will be more eligible and ran earlier.

    Tick driven preemption is driven by request/slice completion; while
    wakeup preemption is driven by the deadline.

    Because the tree is now effectively an interval tree, and the
    selection is no longer 'leftmost', over-scheduling is less of a
    problem.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org

 include/linux/sched.h   |   4 +
 kernel/sched/core.c     |   1 +
 kernel/sched/debug.c    |   6 +-
 kernel/sched/fair.c     | 338 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |   3 +
 kernel/sched/sched.h    |   4 +-
 6 files changed, 308 insertions(+), 48 deletions(-)

dsmythies · March 24, 2024, 3:17pm

While the scheduler change seems to be the original reason for the performance change, there is also the effect of systemd.unified_cgroup_hierarchy=0. After working on this intensely for January/February, I haven’t been able to get back to it in March. I still hope to get back to this at some point.

I did observe this:

NOT Present: systemd.unified_cgroup_hierarchy=0
(yes, seems backwards.)
doug@s19:~$ cat /proc/cgroups | column -t
#subsys_name  hierarchy  num_cgroups  enabled
cpuset        0          73           1
cpu           0          73           1
cpuacct       0          73           1
blkio         0          73           1
memory        0          73           1
devices       0          73           1
freezer       0          73           1
net_cls       0          73           1
perf_event    0          73           1
net_prio      0          73           1
hugetlb       0          73           1
pids          0          73           1
rdma          0          73           1
misc          0          73           1

and

systemd.unified_cgroup_hierarchy=0

doug@s19:~$ cat /proc/cgroups | column -t
#subsys_name  hierarchy  num_cgroups  enabled
cpuset        8          1            1
cpu           7          1            1
cpuacct       7          1            1
blkio         9          1            1
memory        2          72           1
devices       4          32           1
freezer       13         1            1
net_cls       6          1            1
perf_event    3          1            1
net_prio      6          1            1
hugetlb       5          1            1
pids          11         36           1
rdma          10         1            1
misc          12         1            1