Benchmarking a distribution (and some -O3 results)

In our previous Foundations update @mclemenceau shared an update on some of the experiments we conducted during the Plucky Puffin cycle, in particular about our decision to revert the change to the -O3 optimization level by default on amd64. As promised here is a few more details about what went into this project, methods used and challenges encountered.

Results of -O3 experiment tl;dr

Performance is slightly worse and packages are slightly bigger with -O3 by default and we will revert the change early in the next development cycle.

Our experiments

When we first profiled an -O3 rebuild of noble, the results were mixed but hinted at a small improvement. We decided to enable -O3 by default for packages built in plucky and continued investigating.

Benchmarking a distribution

Benchmarking is hard at the best of times. Benchmarking a distribution like Ubuntu which contains nearly 40 thousand packages at the time of writing is clearly even worse!

Obviously it would be – in some sense – possible to write a benchmark for each piece of software, run it in a carefully controlled way and construct some kind of dashboard to alert us to interesting changes in performance. But equally obviously, this is not in any way feasible.

All we can realistically do is find or adapt some benchmarks, run them with and without the proposed changes and assume (hope!) that these are somewhat representative of the impact on the distribution as a whole.

Finding some benchmarks

In this space, the most well known option is the Phoronix test suite and the related set of tests maintained on openbenchmarking.org, which contains many different benchmarks of many different pieces of software.

However, not every benchmark is suited for testing the sort of change we are interested in here. For example the pts/compress-zstd test downloads the zstd upstream source, builds it with upstream’s default optimization flags then benchmarks it – which while interesting if you want to compare something like kernel configuration choices, is not useful if we want to analyze a change to the default flags used during a package build.

Fortunately for us, there are some tests, such as system/cryptsetup-1.01 which are designed to test packages from the distribution. There are also quite a few tests that are fairly easy to convert / hack so that they install packages from the distribution and test those instead. You can see the tests we have been running in https://github.com/canonical/ubuntu-pts-selection (and also, if you are extra curious, you can also see the changes we made to test distro packages).

We would welcome contributions to add tests of more packages here!

Rebuilding the distribution

Rather than comparing benchmark results from moving target of plucky (with -O3 enabled) with the released version of oracular (which had -O2 as the default), we instead rebuilt oracular with -O3 as the default.

The first part of this is to change the dpkg source so that dpkg-buildflags will use the new flags by default, which is fairly routine packaging work, and upload it to a PPA.

Then we need to

  1. make a copy of the Ubuntu archive
  2. add a dependency on the PPA mentioned above
  3. rebuild all the packages
  4. publish them somewhere

The first parts of this are things that the Ubuntu project has been doing for a long time for our “ftbfs” (failed to build from source) reports. These are done a few times a cycle as a general measure of archive health and before making potentially disruptive changes like changing the default version of GCC. Historically the binary packages resulting from these builds have been discarded, but relatively recently support was added to publish the built packages as an apt archive. We can then build images from this apt archive and use them for testing.

Running the tests

As mentioned in the article I linked way back at the start of this post, benchmarking is a delicate art. Ideally we would run the tests on a variety of hardware (even laptops, if we can take advantage of Phoronix’s parameters to allow the laptop to cool down between tests) but for this round of tests we “borrowed” some servers from the Foundation cloud team in Canonical. The processors in these were 4-core AMD EPYC Rome CPU. The whole process was a bit more manual that we would have liked, something to work on next time around.

Even though we are running a fairly small selection of tests, each run of the benchmarks still takes around 48 hours to complete!

Results!!

For our latest investigation, we re-ran the benchmarks that showed the most change in our earlier benchmarks, and finally tried to understand what it was that had caused the changes (which can be extremely time consuming).

A selection of observations from these benchmarks:

  • xz-utils: Upstream defaults to -O2. Overriding this to -O3 showed no difference in performance. Installed size went from 717kb to 750 kb (4.6% increase).
  • tiff decompression: this test highlights the issue with the size increase. The majority of time is spent in shared library load and the test regresses by 11.9%. Other tests regressed for the same reason - inkscape, rawtherapee, libjpeg-turbo, gegl, gnupg, gzip.
  • gnuradio: -O3 + lto resulted in inlining of the execute call to libfftw3f. The overall performance decreased because of that by 12.7%.

We could also do some board brush analysis of the size impact of the change:

  • the size of all debs in the archive increases by 6% when O3 is the default
  • the size of the desktop ISO increases by a bit less than 4%.

Conclusion and next steps

Distro-wide O3 does not seem to provide better performance and regresses it in interactive tasks. It affects load times by up to 11.9% which is detrimental to container workloads that strive for fast application startup.

Overall, this regression is mostly caused by the increase in the executable size. In addition, in some situations -O3 and Link Time Optimization together can result in excessive function inlining which then in turn regresses performance (presumably through register pressure or instruction cache thrashing).

It is fair to say that all of this is inline with conventional wisdom: -O3 can help in some situations but is not a sensible default. But it is definitely better to verify this than just rely on folk knowledge! In addition we still want to investigate more ways to improve the performance of Ubuntu, and each time we exercise our benchmarking muscles we will get better at it.

12 Likes

Will you open bug reports against upstream projects, recommending to annotate certain function calls with GCC specific noinline?

Will you open bug reports against GCC, indicating example projects which appear to regress in performance with O3 and LTO, for toolchain maintainers to consider if this is/isn’t a corner case or if compiler choices could be made more adequate?

Will the dpkg build flags default to O3 be reverted for amd64?

Will the dpkg build flags default to O3 be reverted for pcc64le? Which has been on since the port inception. I wonder if this ongoing choice has hindered ppc64le market share growth.

Will packages that got built with O3 be rebuilt with O2?

Looking at Ubuntu.pm « Vendor « Dpkg « scripts - ubuntu/+source/dpkg - [no description]

I don’t think we have resources to do this in any systematic way.

Ditto. I don’t think GCC upstream would be surprised about any of this.

When QQ opens yes (this is in the post btw).

We enabled O3 because IBM asked us to (as you well know). Maybe we should check with them if they wish to reconsider.

We should check how many packages have not been rebuilt towards the end of the 25.10 cycle and see how many packages that are still current were built with a dpkg that defaulted to O3 and make some kind of decision about what to do. We are not going to worry about whether packages in 25.04 are built with O3 or not at this point in the release cycle.