24.04 considerably slower than 20.04 or 22.04 for some high system percentage usage cases

Okay, now we are getting somewhere. This is the grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp systemd.unified_cgroup_hierarchy=0 cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"

and now on 24.04 I get:

Samples: 80  ; Ave: 15.00202  ; Var:  0.20312  ; S Dev:  0.45069 ; Min: 14.05610 ; Max: 15.70310 ; Range:  1.64700 ; Comp to ref:   1.56%

Instead of the ~20% worse range from the earlier post.

It really seems to be cgroup related then. Have you tried the same boot options also on 20.04?

The new “Master” reference average, used in an earlier post was on 20.04 on an internal nvme drive with this grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"

That average of a few runs was 14.77114333 uSec per loop. The test is the 40 ping pong pairs, 30,000,000 loops per pair one.

These are the test results using this grub command line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp systemd.unified_cgroup_hierarchy=0 cgroup_disable=memory cgroup_disable=pressure cgroup_no_v1=all msr.allow_writes=on cpuidle.governor=teo"
24.04 on internal nvme drive (4 test runs, with result 1 being a repeat of earlier post):
Samples: 80  ; Ave: 15.00202  ; Var:  0.20312  ; S Dev:  0.45069 ; Min: 14.05610 ; Max: 15.70310 ; Range:  1.64700 ; Comp to ref:   1.56%
Samples: 80  ; Ave: 15.13712  ; Var:  0.19680  ; S Dev:  0.44362 ; Min: 14.00730 ; Max: 15.73140 ; Range:  1.72410 ; Comp to ref:   2.48%
Samples: 80  ; Ave: 14.97392  ; Var:  0.23093  ; S Dev:  0.48056 ; Min: 13.89290 ; Max: 15.70420 ; Range:  1.81130 ; Comp to ref:   1.37%
Samples: 80  ; Ave: 15.14079  ; Var:  0.17549  ; S Dev:  0.41892 ; Min: 14.10390 ; Max: 15.80830 ; Range:  1.70440 ; Comp to ref:   2.50%

20.04 on internal nvme drive (4 test runs):
Samples: 80  ; Ave: 14.83415  ; Var:  0.21711  ; S Dev:  0.46595 ; Min: 13.88660 ; Max: 15.56870 ; Range:  1.68210 ; Comp to ref:   0.43%
Samples: 80  ; Ave: 14.83119  ; Var:  0.21669  ; S Dev:  0.46550 ; Min: 13.82980 ; Max: 15.52110 ; Range:  1.69130 ; Comp to ref:   0.41%
Samples: 80  ; Ave: 15.00895  ; Var:  0.11571  ; S Dev:  0.34017 ; Min: 13.84090 ; Max: 15.60840 ; Range:  1.76750 ; Comp to ref:   1.61%
Samples: 80  ; Ave: 14.92387  ; Var:  0.15777  ; S Dev:  0.39720 ; Min: 14.04810 ; Max: 15.59750 ; Range:  1.54940 ; Comp to ref:   1.03%

For an earlier suggestion:

grep . /proc/sys/kernel/sched*

the results are exactly the same between 20.04 and 24.04:

/proc/sys/kernel/sched_autogroup_enabled:1
/proc/sys/kernel/sched_cfs_bandwidth_slice_us:5000
/proc/sys/kernel/sched_child_runs_first:0
/proc/sys/kernel/sched_deadline_period_max_us:4194304
/proc/sys/kernel/sched_deadline_period_min_us:100
/proc/sys/kernel/sched_energy_aware:1
/proc/sys/kernel/sched_rr_timeslice_ms:100
/proc/sys/kernel/sched_rt_period_us:1000000
/proc/sys/kernel/sched_rt_runtime_us:950000
/proc/sys/kernel/sched_schedstats:0
/proc/sys/kernel/sched_util_clamp_max:1024
/proc/sys/kernel/sched_util_clamp_min:1024
/proc/sys/kernel/sched_util_clamp_min_rt_default:1024

Now that we have isolated the major difference down to systemd.unified_cgroup_hierarchy=0, the average degradation for the 40 pairs ping-pong test for 24.04 is about 16.5%.

I do not know how to investigate further.

It seems mainline kernel 6.5 does not suffer from the main performance degradation issue of this thread, but mainline kernel 6.6-rc1 does. Therefore starting points for a kernel bisection are defined.
I did this latest work on a debian installation and also my build environment is on my 20.04 server. If I am going to bisect the kernel, then I’d want to setup kernel compile ability on the 24.04 server and go from there. It would take me awhile (like a week).

Wait wait wait
 in 6.6 we had the switch from CFS to EEVDF! Basically 6.6 has a totally different scheduler. So, even if the settings are identical it might do something completely different and the cgroup hierarchy can definitely affect the overall performance.

1 Like

Very interesting. I only have a couple of steps left in the kernel bisection and did notice this:

Bisecting: 3 revisions left to test after this (roughly 2 steps)
[147f3efaa24182a21706bca15eab2f3f4630b5fe] sched/fair: Implement an EEVDF-like scheduling policy

Anyway, at this point I’ll finish the bisection.

EDIT: Indeed, the scheduler change is the issue for this workflow:

doug@s19:~/kernel/linux$ git bisect good
147f3efaa24182a21706bca15eab2f3f4630b5fe is the first bad commit
commit 147f3efaa24182a21706bca15eab2f3f4630b5fe
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed May 31 13:58:44 2023 +0200

    sched/fair: Implement an EEVDF-like scheduling policy

    Where CFS is currently a WFQ based scheduler with only a single knob,
    the weight. The addition of a second, latency oriented parameter,
    makes something like WF2Q or EEVDF based a much better fit.

    Specifically, EEVDF does EDF like scheduling in the left half of the
    tree -- those entities that are owed service. Except because this is a
    virtual time scheduler, the deadlines are in virtual time as well,
    which is what allows over-subscription.

    EEVDF has two parameters:

     - weight, or time-slope: which is mapped to nice just as before

     - request size, or slice length: which is used to compute
       the virtual deadline as: vd_i = ve_i + r_i/w_i

    Basically, by setting a smaller slice, the deadline will be earlier
    and the task will be more eligible and ran earlier.

    Tick driven preemption is driven by request/slice completion; while
    wakeup preemption is driven by the deadline.

    Because the tree is now effectively an interval tree, and the
    selection is no longer 'leftmost', over-scheduling is less of a
    problem.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org

 include/linux/sched.h   |   4 +
 kernel/sched/core.c     |   1 +
 kernel/sched/debug.c    |   6 +-
 kernel/sched/fair.c     | 338 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |   3 +
 kernel/sched/sched.h    |   4 +-
 6 files changed, 308 insertions(+), 48 deletions(-)
3 Likes

While the scheduler change seems to be the original reason for the performance change, there is also the effect of systemd.unified_cgroup_hierarchy=0. After working on this intensely for January/February, I haven’t been able to get back to it in March. I still hope to get back to this at some point.

I did observe this:

NOT Present: systemd.unified_cgroup_hierarchy=0
(yes, seems backwards.)
doug@s19:~$ cat /proc/cgroups | column -t
#subsys_name  hierarchy  num_cgroups  enabled
cpuset        0          73           1
cpu           0          73           1
cpuacct       0          73           1
blkio         0          73           1
memory        0          73           1
devices       0          73           1
freezer       0          73           1
net_cls       0          73           1
perf_event    0          73           1
net_prio      0          73           1
hugetlb       0          73           1
pids          0          73           1
rdma          0          73           1
misc          0          73           1

and

systemd.unified_cgroup_hierarchy=0

doug@s19:~$ cat /proc/cgroups | column -t
#subsys_name  hierarchy  num_cgroups  enabled
cpuset        8          1            1
cpu           7          1            1
cpuacct       7          1            1
blkio         9          1            1
memory        2          72           1
devices       4          32           1
freezer       13         1            1
net_cls       6          1            1
perf_event    3          1            1
net_prio      6          1            1
hugetlb       5          1            1
pids          11         36           1
rdma          10         1            1
misc          12         1            1

Hello @dsmythies ! I would like to replicate the graph you kindly published, myself. Where can I find the test C program used to create it? Thanks! :slight_smile:

Coincidentally, and after spending the last month working on something else in this same area, I am just creating a new graph to post here. I create the graph using a spreadsheet from the outputs of a program. The program is run from a script. I can post the stuff here later.

1 Like

I never did get back to this, but was just working on something in this area of the kernel. The cgroups stuff is still really expensive for some types of workflow.

40 pingpong pairs compare. Kernel 6.13 2025.01.24
Average uSec per loop for 30 million loops 40 pairs:

noauto-nodelay 16.4521 reference
noauto-nodelay-2 16.4241 -0.16%
noauto-delay 15.7742 -4.14%
noauto-delay-2 15.7481 -4.29%
auto-nodelay 24.4523 48.78%
auto-nodelay-2 24.6121 49.75%
auto-delay 19.9973 21.75%
auto-delay-2 19.9500 21.47%

Where:
noauto = “cgroup_disable=cpu noautogroup” on grub command line
nodelay = NO_DELAY_DEQUEUE (echo NO_DELAY_DEQUEUE | sudo tee /sys/kernel/debug/sched/features)

Graphically:

Earlier @simonhf asked for the program and such.
Disclaimer: this stuff was never intended to be seen by others and it is a total hack job.

I keep all c code and scripts and such in the same directory. The command to create the data for one of the graph lines is:

./ping-pong-many-parallel 0 30000000 40

And the CPU frequency scaling governor should be set to performance for this work.

The script source code is:

#! /bin/dash
#
# ping-pong-many-parallel Smythies 2024.01.23
#       assume the ping pong program is local.
#
# ping-pong-many-parallel Smythies 2022.10.23
#       update required to reflect changes to program
#
# ping-pong-many-parallel Smythies 2022.10.09
#       Launch parrallel ping-pong pairs.

# because I always forget from last time
killall pingpong

# If it does not already exist, then create the first named pipe.

COUNTER=0
POINTER1=0
POINTER2=1
while [ $COUNTER -lt $3 ];
do
   if [ -p /dev/shm/pong$POINTER1 ]
   then
     rm /dev/shm/pong$POINTER1
   fi
   mkfifo /dev/shm/pong$POINTER1

   POINTER1=$(($POINTER1+1000))
   POINTER2=$(($POINTER2+1000))
   COUNTER=$(($COUNTER+1))
done

COUNTER=0
POINTER1=0
POINTER2=1
while [ $COUNTER -lt $3 ];
do
   ./pingpong /dev/shm/pong$POINTER1 /dev/shm/pong$POINTER2 $1 $2 &
   ./pingpong /dev/shm/pong$POINTER2 /dev/shm/pong$POINTER1 $1 $2 1 &

   POINTER1=$(($POINTER1+1000))
   POINTER2=$(($POINTER2+1000))
   COUNTER=$(($COUNTER+1))
done

And the c program source code is:

/******************************************************
/*
/* pingpong.c Smythies 2022.10.21
/*      Useing stdin and stdout redirection for this
/*      program is a problem. The program doesn't start
/*      execution until there is something in the
/*      stdin redirected queue, so trying to start
/*      things via the last flag doesn't work.
/*      Try treating the incoming and outgoing named
/*      as files opened herein. This will also allow
/*      timeout management as a future edit.
/*
/* pingpong.c Smythies 2022.10.20
/*      Use the new "last" flag to also start the
/*      token passing.
/*
/* pingpong.c Smythies 2022.10.19
/*      If the delay between the last read of the
/*      first token and the write from the last place
/*      in the chain of stuff is large enough then the
/*      first intance of the program might have terminated
/*      and shutdown the read pipe, resulting in a SIGPIPE
/*      signal. With no handler it causes the program to
/*      terminate.
/*      Add an optional command line parameter to indicate if
/*      this instance of the program is the last one and
/*      therefore it should not attempt to pass along the
/*      last token.
/*
/* pingpong.c Smythies 2021.10.26
/*      Eveything works great as long as the number
/*      of stops in the token passing ring is small
/*      enough. However, synchronization issues
/*      develop if the number of stops gets big enough.
/*      Introduce a synchorizing step, after which
/*      there should not be any EOF return codes.
/*
/* pingpong.c Smythies 2021.10.24
/*      Print loop number and error code upon error
/*      exit. Exit on 1st error. Was 3rd.
/*
/* pingpong.c Smythies 2021.10.23
/*      Change to using CLOCK_MONOTONIC_RAW instead of
/*      gettimeofday, as it doesn't have any
/*      adjustments.
/*      Change to nanoseconds.
/*
/* pingpong.c Smythies 2021.07.31
/*      Add write error check.
/*
/* pingpong.c Smythies 2021.07.24
/*      Exit after a few errors.
/*
/* pingpong.c Smythies 2021.07.23
/*      Add execution time.
/*
/* pingpong.c Smythies 2020.12.07
/*      Add an outter loop counter comnmand line option.
/*      Make it optional, so as not to break my existing
/*      scripts.
/*
/* pingpong.c Smythies 2020.06.21
/*      The original code is from Alexander.
/*      (See: https://marc.info/?l=linux-kernel&m=159137588213540&w=2)
/*      But, it seems to get out of sync in my application.
/*      Start this history header.
/*      I can only think of some error return.
/*      Add some error checking, I guess.
/*
/******************************************************/

#include <sys/time.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <limits.h>
#include <errno.h>
#include <string.h>
//#include <signal.h>
//#include <sys/wait.h>
//#include <linux/unistd.h>

#define MAX_ERRORS 2
/* Aribitrary */
#define SYNC_LOOPS 3

unsigned long long stamp(void){
   struct timespec tv;

   clock_gettime(CLOCK_MONOTONIC_RAW,&tv);

   return (unsigned long long)tv.tv_sec * 1000000000 + tv.tv_nsec;
} /* endprocedure */

int main(int argc, char **argv){
   unsigned long long tend, tstart;
   long i, j, k, n, m;
   long eof_count = 0;
   int error_count = 0;
   int err, inf, outf, errvalue;
   int last = 0;
   char c = '\n';
   char *infile, *outfile;

//   fprintf(stderr, "begin...\n");

   switch(argc){
   case 4:
      infile = argv[1];
      outfile = argv[2];
      n = atol(argv[3]);
      m = LONG_MAX;
      break;
   case 5:
      infile = argv[1];
      outfile = argv[2];
      n = atol(argv[3]);
      m = atol(argv[4]);
      break;
   case 6:
      infile = argv[1];
      outfile = argv[2];
      n = atol(argv[3]);
      m = atol(argv[4]);
      last = atoi(argv[5]);
      break;
   default:
      printf("%s : Useage: pingpong infifo outfifo inner_loop [optional outer_loop [optional last flag]]\n", argv[0]);
      return -1;
   } /* endcase */

//   printf(" infile: %s  ; outfile: %s  ; %d\n", infile, outfile, last);

   if(last != 1){  // for all but the last, create the named pipe outfile
      err = mkfifo(outfile, 0666);
      if ((err != 0) && (errno != EEXIST)){ // file already exists is OK
         errvalue = errno;
         printf("Cannot create output fifo file: %s ; %d ; %s\n", outfile, err, strerror(errvalue));
         return -1;
      } /* endif */
   } else {   // for the last we open the write first, read should already be open.
      if ((outf = open(outfile, O_WRONLY))  == -1){
         errvalue = errno;
         printf("Cannot open last output fifo file: %s ; %d ; %s\n", outfile, outf, strerror(errvalue));
         return -1;
      } /* endif */
   } /* endif */

   if ((inf = open(infile, O_RDONLY)) == -1){
      errvalue = errno;
      printf("Cannot open input fifo file: %s ; %d ; %s\n", outfile, inf, strerror(errvalue));
      return -1;
   } /* endif */

   if(last != 1){  // for all but the last, now we open the write
//   if ((outf = open(outfile, O_WRONLY | O_NONBLOCK))  == -1){
      if ((outf = open(outfile, O_WRONLY))  == -1){
         errvalue = errno;
         printf("Cannot open not last output fifo file: %s ; %d ; %s\n", outfile, outf, strerror(errvalue));
         return -1;
      } /* endif */
   } /* endif */

   if(last == 1){  // the last chain initiates the token passing
//      usleep(999999);
      err = write(outf, &c, 1);
      if(err != 1){
         fprintf(stderr, "pingpong write error on startup, aborting. %d  %d  %d\n", last, err, outf);
         return -1;
      } /* endif */
   } /* endif */

//   printf("flag 4: inf: %d  ; outf: %d  ; %d \n", inf, outf, last);

/* make sure we are synchronized. EOF (0 return code) can occur until we are */

   j = SYNC_LOOPS;
   while(j > 0) {  // for SYNC_LOOP successful loops do:
      err = read(inf, &c, 1);
      if(err == 1){
         j--;        // don't decrement for EOF.
         for (i = n; i; i--){  // we also attempt to sync in time for later T start
            k = i;
            k = k++;
         } /* endfor */
         err = write(outf, &c, 1);
         if(err != 1){ // and then pass along the token along to the next pipeline step.
            fprintf(stderr, "pingpong sync step: write error or timeout to named pipe. (error code: %d ; loops left: %ld ; last: %d)\n", err, j, last);
            return -1;
         } /* endif */
      } else {
         if(err < 0){
            fprintf(stderr, "pingpong sync step: read error or timeout from named pipe. (error code: %d ; loops left: %ld ; last: %d)\n", err, j, last);
            return -1;
         } else {
            eof_count++;  // does the loop counter need to be reset??
         } /* endif */
      } /* endif */
   } /* endwhile */

//   printf(" infile: %s  ; outfile: %s  ; last: %d; eof_count %ld\n", infile, outfile, last, eof_count);

/* now we are synchronized, or so I claim. Get on with the real work. EOF is an error now.*/

   j = m;
   tstart = stamp(); /* only start the timer once synchronized */
   while(j > 0) {  // for outer_loop times do:
      err = read(inf, &c, 1);
      if(err == 1){
         for (i = n; i; i--){  // for each token, do a packet of work.
            k = i;
            k = k++;
         } /* endfor */
         err = write(outf, &c, 1);
         if(err != 1){ // and then pass along the token along to the next pipeline step.
            fprintf(stderr, "pingpong write error or timeout to named pipe. (error code: %d ; loops left: %ld ; EOFs: %ld ; last: %d)\n", err, j, eof_count, last);
            error_count++;
            if(error_count >= MAX_ERRORS) return -1;
         } /* endif */
      } else {
         error_count++;
         fprintf(stderr, "pingpong read error or timeout from named pipe. (error code: %d ; loops left: %ld ; EOFs: %ld ; last: %d)\n", err, j, eof_count, last);
         if(error_count >= MAX_ERRORS) return -1;
      } /* endif */
//      if(j <= 3) fprintf(stderr, "Loop: %ld ; EOFs: %ld\n", j, eof_count);
      j--;
   } /* endwhile */
   tend = stamp();  // the timed portion is done

/* Now we do one token pass to flush. The previous write pipe may have already been terminated, so EOF read response is O.K. */

   err = read(inf, &c, 1);
   if(err == 1){
      if(last != 1){  // last in the chain does not pass along the last token
         err = write(outf, &c, 1);
         if(err != 1){ // and then pass along the token along to the next pipeline step.
            fprintf(stderr, "pingpong flush loop: write error or timeout to named pipe. (error code: %d ; EOFs: %ld ; last: %d)\n", err, eof_count, last);
         } /* endif */
      } /* endif */
      } else {
         fprintf(stderr, "pingpong flush loop: read error or timeout from named pipe. (error code: %d ; EOFs: %ld ; last: %d)\n", err, eof_count, last);
      } /* endif */

   fprintf(stderr,"%.4f usecs/loop. EOFs: %ld\n",(double)(tend-tstart)/((double) m * 1000.0), eof_count);
   close(outf);
   close(inf);
   return -1;
//   return 0;
} /* endprogram */

EDIT: The script and program are also posted over on Ubuntu forums, along with more thorough how to instructions.

1 Like

You might consider moving those more thorough how-to instructions to somewhere here as the forums are set for eventual archival, even if it’s just a copy-paste.

1 Like

Perhaps consider the Tutorials section with all the relevant instructions and some test cases?

1 Like

Thanks for sharing the source code @dsmythies ! Here is my replication attempt
 :slight_smile:

I had the following VMware VMs available to me (both running on the same underlying type of hardware) to try and reproduce Doug’s results:

  1. Ubuntu 24.04.1 LTS with 2 vCPUs
  2. Ubuntu 20.04.1 LTS with 2 vCPUs

With only 2 vCPUs then the instances took much longer than the 8 or 9 minutes to run that Doug reported, so I used a 3M instead of 30M loop which reduced the running time from a couple of hours to ~ 10 minutes or so.

$ dash ./ping-pong-many-parallel 0 3000000 40 # ubuntu 24
pingpong: no process found
237.7967 usecs/loop. EOFs: 0
237.7967 usecs/loop. EOFs: 0
238.4970 usecs/loop. EOFs: 0
...

$ dash ./ping-pong-many-parallel 0 3000000 40 # ubuntu 20
pingpong: no process found
190.8806 usecs/loop. EOFs: 0
190.8806 usecs/loop. EOFs: 0
191.0145 usecs/loop. EOFs: 0
...

Ubuntu 24 appears to be up to 26.2% slower than Ubuntu 20.

Why am I trying out Doug’s code? Because I recently noticed that Redis appears to be running ~ 12% slower on Ubuntu 24 than Ubuntu 20 too. I’d love to discover a way to make Redis on Ubuntu 24 have similar performance to Redis on Ubuntu 20
 Any ideas how?

As I had also mentioned in our off-line email thread, please try booting with “cgroup_disable=cpu noautogroup” added to whatever else is already on the “GRUB_CMDLINE_LINUX_DEFAULT=” line in “/etc/default/grub”. See if that makes a difference to your Redis thing. You could also do the pingpong test with it to compare with the other tests.
In my tests (see post from a few days ago) it made a huge difference.
Here is an example of the grub line with my other stuff included:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp cgroup_disable=cpu noautogroup msr.allow_writes=on cpuidle.governor=teo"

1 Like

@simonhf : Additionally, you could use “systemd.unified_cgroup_hierarchy=0 noautogroup” on the grub command line, as improvement is the same. Some might be aware of “/proc/sys/kernel/sched_autogroup_enabled” as a way to toggle autogroup on and off, but it doesn’t always give the expected result. In the below graph:

nocpu = grub “cgroup_disable=cpu”
noauto = grub “noautogroup”
nosched = echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled
sched = echo1 | sudo tee /proc/sys/kernel/sched_autogroup_enabled
nohierarchy = grub “systemd.unified_cgroup_hierarchy=0”

The reference for the relative percentages is an average of the 4 nocpu good ones and it had an average loop time of 15.778 uSec for the 40 pairs, a minimum of 14.330 (pair 1), maximum of 17.215 (pair40) uSec per loop.

Hi @dsmythies !

Thanks for the suggestion to modify GRUB_CMDLINE_LINUX_DEFAULT which I tried:

After booting with the modification then I got:

$ lsb_release -a | egrep Description
Description:	Ubuntu 24.04.1 LTS

$ cat /etc/default/grub | egrep GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash rootdelay=120 cgroup_disable=cpu noautogroup"

$ cat /proc/sys/kernel/sched_autogroup_enabled
1

Note: sched_autogroup_enabled is also set to 1 on the non-modified Ubuntu 24 too


So not really sure how to determine if the GRUB_CMDLINE_LINUX_DEFAULT really did anything?! :slight_smile:

I re-run the Redis benchmark but not improvement. And I re-run your C program and it was fractionally worse this time; 245.5984 usecs/loop vs 238.6951 usecs/loop without the modification.

Any ideas what I might be doing wrong, or why the change works for you, but not for me?

Update: @dsmythies I also re-ran the C program after explicitly setting sched_autogroup_enabled but this also made no difference to the results :frowning:

$ echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled
0
$ cat /proc/sys/kernel/sched_autogroup_enabled
0

Update: @dsmythies I also tried booting with this change, but also no big difference in the performance of Redis or the C program:

$ cat /etc/default/grub | egrep GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash rootdelay=120 systemd.unified_cgroup_hierarchy=0 noautogroup"

$ cat /proc/sys/kernel/sched_autogroup_enabled
1

And again setting explicitly setting sched_autogroup_enabled but also no big difference in the performance of Redis or the C program.

$ echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled
0

$ cat /proc/sys/kernel/sched_autogroup_enabled
0

Frustrating
 :frowning: