In our previous Foundations update @mclemenceau shared an update on some of the experiments we conducted during the Plucky Puffin cycle, in particular about our decision to revert the change to the -O3
optimization level by default on amd64
. As promised here is a few more details about what went into this project, methods used and challenges encountered.
Results of -O3
experiment tl;dr
Performance is slightly worse and packages are slightly bigger with -O3
by default and we will revert the change early in the next development cycle.
Our experiments
When we first profiled an -O3
rebuild of noble, the results were mixed but hinted at a small improvement. We decided to enable -O3
by default for packages built in plucky and continued investigating.
Benchmarking a distribution
Benchmarking is hard at the best of times. Benchmarking a distribution like Ubuntu which contains nearly 40 thousand packages at the time of writing is clearly even worse!
Obviously it would be – in some sense – possible to write a benchmark for each piece of software, run it in a carefully controlled way and construct some kind of dashboard to alert us to interesting changes in performance. But equally obviously, this is not in any way feasible.
All we can realistically do is find or adapt some benchmarks, run them with and without the proposed changes and assume (hope!) that these are somewhat representative of the impact on the distribution as a whole.
Finding some benchmarks
In this space, the most well known option is the Phoronix test suite and the related set of tests maintained on openbenchmarking.org, which contains many different benchmarks of many different pieces of software.
However, not every benchmark is suited for testing the sort of change we are interested in here. For example the pts/compress-zstd
test downloads the zstd upstream source, builds it with upstream’s default optimization flags then benchmarks it – which while interesting if you want to compare something like kernel configuration choices, is not useful if we want to analyze a change to the default flags used during a package build.
Fortunately for us, there are some tests, such as system/cryptsetup-1.01 which are designed to test packages from the distribution. There are also quite a few tests that are fairly easy to convert / hack so that they install packages from the distribution and test those instead. You can see the tests we have been running in https://github.com/canonical/ubuntu-pts-selection (and also, if you are extra curious, you can also see the changes we made to test distro packages).
We would welcome contributions to add tests of more packages here!
Rebuilding the distribution
Rather than comparing benchmark results from moving target of plucky (with -O3
enabled) with the released version of oracular (which had -O2
as the default), we instead rebuilt oracular with -O3
as the default.
The first part of this is to change the dpkg
source so that dpkg-buildflags
will use the new flags by default, which is fairly routine packaging work, and upload it to a PPA.
Then we need to
- make a copy of the Ubuntu archive
- add a dependency on the PPA mentioned above
- rebuild all the packages
- publish them somewhere
The first parts of this are things that the Ubuntu project has been doing for a long time for our “ftbfs” (failed to build from source) reports. These are done a few times a cycle as a general measure of archive health and before making potentially disruptive changes like changing the default version of GCC. Historically the binary packages resulting from these builds have been discarded, but relatively recently support was added to publish the built packages as an apt archive. We can then build images from this apt archive and use them for testing.
Running the tests
As mentioned in the article I linked way back at the start of this post, benchmarking is a delicate art. Ideally we would run the tests on a variety of hardware (even laptops, if we can take advantage of Phoronix’s parameters to allow the laptop to cool down between tests) but for this round of tests we “borrowed” some servers from the Foundation cloud team in Canonical. The processors in these were 4-core AMD EPYC Rome CPU. The whole process was a bit more manual that we would have liked, something to work on next time around.
Even though we are running a fairly small selection of tests, each run of the benchmarks still takes around 48 hours to complete!
Results!!
For our latest investigation, we re-ran the benchmarks that showed the most change in our earlier benchmarks, and finally tried to understand what it was that had caused the changes (which can be extremely time consuming).
A selection of observations from these benchmarks:
- xz-utils: Upstream defaults to -O2. Overriding this to -O3 showed no difference in performance. Installed size went from 717kb to 750 kb (4.6% increase).
- tiff decompression: this test highlights the issue with the size increase. The majority of time is spent in shared library load and the test regresses by 11.9%. Other tests regressed for the same reason - inkscape, rawtherapee, libjpeg-turbo, gegl, gnupg, gzip.
- gnuradio: -O3 + lto resulted in inlining of the execute call to libfftw3f. The overall performance decreased because of that by 12.7%.
We could also do some board brush analysis of the size impact of the change:
- the size of all debs in the archive increases by 6% when O3 is the default
- the size of the desktop ISO increases by a bit less than 4%.
Conclusion and next steps
Distro-wide O3 does not seem to provide better performance and regresses it in interactive tasks. It affects load times by up to 11.9% which is detrimental to container workloads that strive for fast application startup.
Overall, this regression is mostly caused by the increase in the executable size. In addition, in some situations -O3
and Link Time Optimization together can result in excessive function inlining which then in turn regresses performance (presumably through register pressure or instruction cache thrashing).
It is fair to say that all of this is inline with conventional wisdom: -O3
can help in some situations but is not a sensible default. But it is definitely better to verify this than just rely on folk knowledge! In addition we still want to investigate more ways to improve the performance of Ubuntu, and each time we exercise our benchmarking muscles we will get better at it.