Trying out Ubuntu 23.04 on x86-64-v3 rebuild for yourself

mwhudson · December 12, 2023, 6:01pm

NOTE: THESE IMAGES ARE NOT SUPPORTED. THEY WILL RECEIVE NO SECURITY UPDATES. DO NOT USE THEM IN PRODUCTION.

As mentioned in our recent blog post, we have rebuilt Ubuntu 23.04 targeting the x86-64-v3 ISA and made installer images from that archive. If you have capable hardware and want to try this out for yourself, you can download the installer from here.

As stated above, machines installed using this image will not receive security (or any other) updates and are absolutely not suitable for any kind of production use. That said, if you have any CPU-intensive workloads that you are able to try out, we would be extremely interested to hear about the results you get.

Let us know if you notice any improvements (or regressions), how you are testing, as much about the machine you are testing on as possible, and if you encounter any problems in your server testing.

mwhudson · December 12, 2023, 10:16pm

It is worth mentioning that the compilers in the rebuild archive do not themselves have changed default flags – the options that are used to build Ubuntu are not necessarily the same options as used to compile a user’s own code. If this post makes you wonder about the benefit of compiling your code with different flags, you can of course just add “-march=x86-64-v3 -mtune=icelake-server” to the build flags, or you can install a compiler that does have changed defaults by running:

sudo add-apt-repository ppa:mwhudson/toolchain-test-rebuild-20231018-lunar-v3-mtune-icelake-server
sudo apt update
sudo apt install gcc g++

QwertyChouskie · December 13, 2023, 12:40am

Although the idea of raising the baseline is certainly interesting, some Intel CPUs released as recently as 2021 don’t support AVX/AVX2: https://en.wikipedia.org/wiki/Tremont_(microarchitecture)

Using a dynamic binary loading mechanism based on what instructions are available would likely be a much better option.

jay-tuckey · December 13, 2023, 3:49am

I’ve just checked my home server with an AMD Athlon™ II X4 630 and it doesn’t report AVX as supported in lscpu output - I’m guessing this means it wouldn’t work?
Would be nice to be able to get it to at least 24.04 before support gets dropped, although it has a few years on 22.04 still.

arraybolt3 · December 13, 2023, 5:52am

I am pretty excited to see this kind of thing starting to happen, but at the same time I’m concerned. I have quite a bit of hardware here that most likely does not support v3 (but does support v2 at least). Is it going to potentially be an option to have both the v1 and v3 rebuilds existing in parallel? Arch Linux one time proposed doing something similar, though they never got around to actually doing the v3 build.

(Having an older x86_64-v1 variant of Ubuntu would be particularly important for Lubuntu I would think, as while we don’t officially target older machines anymore, we definitely are still suitable for older machines, we as developers use it on older machines, we have many users who use it on older machines, and I know I at least recommend it to people with older machines.)

mwhudson · December 13, 2023, 5:44pm

I should have expected that given we didn’t say what our plans are for the future people would start to fill in the gaps a bit. But the reason we didn’t say anything about that is mostly that we haven’t decided where we’re going with this. We are literally just running some investigations!

What I can say with certainty is:

Nothing is going to change wrt the amd64 baseline in 24.04.
We are very aware of the issue of users who have hardware that can’t run v3 or even v2! Hopefully we can do something that gets the best of both worlds but like I said, no concrete plans yet.

arraybolt3 · December 13, 2023, 8:20pm

That makes perfect sense. And I’m actively doing some tests and benchmarks over here to help with those investigations.

arraybolt3 · December 14, 2023, 1:27am

Alright, I have done benchmarks and testing.

The Hardware

The primary test machine in this experiment is a Dell Optiplex 9020, featuring an Intel Core i5-4570 CPU @ 3.20 GHz, a 1 TB ADATA SU750 SATA solid-state drive, and 16 GB of 1600MHz PC3-12800 DDR3 RAM (A-Tech brand). Testing was done using both the experimental Lunar build against x86_64-v3 and stock Ubuntu 23.04. This machine is juuuuust barely new enough to run x86_64-v3 code, and I was able to boot the ISO, install it, and do tests on it without much issues (although the not-quite-complete universe build in the test archive threw a wrench in my works a couple of times).

Additionally, I threw in my laptop (a Kubuntu Focus XE Gen 1) for comparison. It features an Intel Core i5 1135g7 CPU and 32 GB of RAM (I think 3200MHz). It also has a 1 TB Samsung NVMe solid state drive. It’s running Kubuntu 22.04 (not the x86_64-v3 rebuild of 23.04). While I attempted to interfere with the benchmarks as little as possible on the primary test machine (start the processing and then don’t touch it unless you absolutely have to until it’s done), I did the benchmarks on my laptop with a full KDE session and several apps running in the background. So my laptop’s benchmarks aren’t as reliable. However, they should give you an idea of how the machine’s speedups stack up to more modern hardware.

The Tests

Three workloads were run - 7zip benchmarks, cryptsetup benchmarks, and compiling the OpenJDK 8 source package using debuild. The former two provided detailed benchmarking info, while the latter (the OpenJDK build) I timed using sudo time debuild. (I had to use sudo as the package wouldn’t build without it for some reason.) These tests were run in a relatively vanilla Ubuntu Server installation, a relatively vanilla installation of the test Lunar rebuild, and also on my laptop which had very little special prep done before running the benchmarks.

Stats

Optiplex, normal Ubuntu Server 23.04, 7zip benchmark results:

7-Zip (z) 22.01 (x64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
 64-bit locale=en_US.UTF-8 Threads:4

Compiler: 12.2.0 GCC 12.2.0
Linux : 6.2.0-39-generic : #40-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:18:00 UTC 2023 x86_64
PageSize:4KB THP:madvise hwcap:2 hwcap2:2
Intel(R) Core(TM) i5-4570 CPU @ 3.20 GHz (306C3)

1T CPU Freq (MHz):  2706  3299  3576  3585  3586  3585  3586
2T CPU Freq (MHz): 199% 3567   200% 3569

RAM size:   15893 MB,  # CPU hardware threads:   4
RAM usage:    889 MB,  # Benchmark threads:      4

                       Compressing  | Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:      19794   345   5582  19256  |     174783   400   3732  14911
23:      20222   366   5627  20604  |     172073   400   3725  14889
24:      18557   365   5467  19953  |     169467   400   3718  14872
25:      17914   370   5525  20454  |     165951   399   3706  14770
----------------------------------  | ------------------------------
Avr:     19122   362   5550  20067  |     170569   399   3720  14861
Tot:             381   4635  17464

Optiplex, x86_64-v3 Lunar rebuild, 7zip benchmark results:

7-Zip (z) 22.01 (x64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
 64-bit locale=en_US.UTF-8 Threads:4

Compiler: 12.2.0 GCC 12.2.0
Linux : 6.2.0-20-generic : #20-Ubuntu SMP PREEMPT_DYNAMIC Thu Oct 19 21:30:29 UTC 2023 : x86_64
PageSize:4KB THP:madvise hwcap:2 hwcap2:2
Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (306C3)

1T CPU Freq (MHz):  2647  3211  3581  3590  3588  3588  3587
2T CPU Freq (MHz): 199% 3566   200% 3570

RAM size:   15885 MB,  # CPU hardware threads:   4
RAM usage:    889 MB,  # Benchmark threads:      4

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:      20486   354   5634  19929  |     173682   399   3709  14817
23:      20126   368   5571  20506  |     171373   400   3709  14829
24:      19993   373   5761  21497  |     168567   400   3701  14793
25:      19314   377   5844  22053  |     165046   400   3676  14689
----------------------------------  | ------------------------------
Avr:     19980   368   5702  20996  |     169667   400   3699  14782
Tot:             384   4700  17889

KFocus XE, 7zip benchmark results:

7-Zip (z) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
 64-bit locale=en_US.UTF-8 Threads:8

Compiler: 11.2.0 GCC 11.2.0
Linux : 6.2.0-35-generic : #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct  6 10:23:26 UTC 2 : x86_64
PageSize:4KB THP:madvise hwcap:6 hwcap2:2
11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz (806C1)

1T CPU Freq (MHz):  4085  4129  4171  4156  4181  4159  4181
4T CPU Freq (MHz): 393% 3694   399% 3787

RAM size:   31867 MB,  # CPU hardware threads:   8
RAM usage:   1779 MB,  # Benchmark threads:      8

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:      26902   721   3632  26171  |     206327   791   2224  17594
23:      24302   698   3546  24761  |     202496   790   2216  17516
24:      24940   741   3617  26816  |     199341   793   2204  17490
25:      23687   721   3752  27046  |     192679   782   2191  17144
----------------------------------  | ------------------------------
Avr:     24958   720   3637  26199  |     200211   789   2209  17436
Tot:             755   2923  21817

Optiplex, normal Ubuntu Server 23.04, cryptsetup benchmark results:

# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1263344 iterations per second for 256-bit key
PBKDF2-sha256    1685813 iterations per second for 256-bit key
PBKDF2-sha512    1199743 iterations per second for 256-bit key
PBKDF2-ripemd160  731224 iterations per second for 256-bit key
PBKDF2-whirlpool  513001 iterations per second for 256-bit key
argon2i       7 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      7 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm |       Key |      Encryption |      Decryption
        aes-cbc        128b       666.6 MiB/s      2698.9 MiB/s
    serpent-cbc        128b        91.3 MiB/s       581.6 MiB/s
    twofish-cbc        128b       192.5 MiB/s       372.7 MiB/s
        aes-cbc        256b       505.1 MiB/s      2113.6 MiB/s
    serpent-cbc        256b        92.9 MiB/s       580.8 MiB/s
    twofish-cbc        256b       195.3 MiB/s       372.7 MiB/s
        aes-xts        256b      2398.2 MiB/s      2404.5 MiB/s
    serpent-xts        256b       532.0 MiB/s       522.2 MiB/s
    twofish-xts        256b       344.6 MiB/s       347.7 MiB/s
        aes-xts        512b      1899.7 MiB/s      1897.5 MiB/s
    serpent-xts        512b       537.5 MiB/s       521.9 MiB/s
    twofish-xts        512b       348.0 MiB/s       347.2 MiB/s

Optiplex, x86_64-v3 Lunar rebuild, cryptsetup benchmark results:

# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1248304 iterations per second for 256-bit key
PBKDF2-sha256    1669707 iterations per second for 256-bit key
PBKDF2-sha512    1223542 iterations per second for 256-bit key
PBKDF2-ripemd160  758738 iterations per second for 256-bit key
PBKDF2-whirlpool  511001 iterations per second for 256-bit key
argon2i       7 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000ms time)
argon2id      7 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000ms time)
#     Algorithm |       Key |      Encryption |      Decryption
        aes-cbc        128b       666.9 MiB/s      2695.0 MiB/s
	serpent-cbc        128b        91.8 MiB/s       580.0 MiB/s
    twofish-cbc        128b       191.5 MiB/s       372.1 MiB/s
		aes-cbc        256b       504.6 MiB/s      2110.6 MiB/s
    serpent-cbc        256b        93.1 MiB/s       580.9 MiB/s
	twofish-cbc        256b       195.2 MiB/s       371.5 MiB/s
		aes-xts        256b      2395.7 MiB/s      2402.4 MiB/s
    serpent-xts        256b       530.7 MiB/s       521.5 MiB/s
    twofish-xts        256b       345.6 MiB/s       347.9 MiB/s
		aes-xts        512b      1896.3 MiB/s      1895.3 MiB/s
    serpent-xts        512b       536.9 MiB/s       521.8 MiB/s
    twofish-xts        512b       348.3 MiB/s       347.1 MiB/s

KFocus XE cryptsetup benchmark results:

# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      2335358 iterations per second for 256-bit key
PBKDF2-sha256    4088015 iterations per second for 256-bit key
PBKDF2-sha512    1646116 iterations per second for 256-bit key
PBKDF2-ripemd160  967321 iterations per second for 256-bit key
PBKDF2-whirlpool  672164 iterations per second for 256-bit key
argon2i       7 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      7 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm |       Key |      Encryption |      Decryption
        aes-cbc        128b      1620.4 MiB/s      5936.8 MiB/s
    serpent-cbc        128b        98.1 MiB/s       371.2 MiB/s
    twofish-cbc        128b       246.1 MiB/s       453.3 MiB/s
        aes-cbc        256b      1243.1 MiB/s      4794.6 MiB/s
    serpent-cbc        256b       105.1 MiB/s       370.6 MiB/s
    twofish-cbc        256b       251.9 MiB/s       452.6 MiB/s
        aes-xts        256b      4799.7 MiB/s      4808.6 MiB/s
    serpent-xts        256b       346.0 MiB/s       347.9 MiB/s
    twofish-xts        256b       416.9 MiB/s       419.4 MiB/s
        aes-xts        512b      4247.2 MiB/s      4250.8 MiB/s
    serpent-xts        512b       370.9 MiB/s       350.6 MiB/s
    twofish-xts        512b       417.2 MiB/s       416.4 MiB/s

Optiplex, normal Ubuntu Server 23.04, OpenJDK 8 build times (debuild of openjdk-8 source package from Lunar):

16035.05user 1197.54system 1:34:15elapsed 304%CPU (0avgtext+0avgdata 7030872maxresident)k
3521864inputs+67711640outputs (2900major+258587136minor)pagefaults 0swaps

Optiplex, x86_64-v3 Lunar rebuild, OpenJDK 8 build times (debuild of openjdk-8 source package from Lunar):

16011.88user 1189.26system 1:33:29elapsed 306%CPU (0avgtext+0avgdata 6372128maxresident)k
2098064inputs+67909208outputs (9429major+256627718minor)pagefaults 0swaps

No openJDK 8 build test was done on the KFocus XE as my build environment on my laptop is not suitable for benchmarking.

Comparison

A spreadsheet with the benchmark data is available here (ODS format): https://drive.google.com/file/d/1mmHZhI_sRq0ont0G9p-CCiMT0jYzLOBl/view?usp=sharing

7zip Benchmark:

Percentages are ratios of KiB/s, higher is better.

Optiplex, normal Ubuntu Server 23.04 (baseline):
- Compression: 100%, Decompression: 100%
Optiplex, x86_64-v3 Lunar rebuild:
- Compression: 104.49%, Decompression: 99.47%
Focus XE:
- Compression: 130.52%, Decompression: 117.38%

Cryptsetup Benchmark:

argon2i and argon2id iterations per second were identical across all three benchmarks and are therefore omitted.

Optiplex, normal Ubuntu Server 23.04 (baseline):
- All algorithms: 100%
Optiplex, x86_64-v3 Lunar rebuild:
- Hashing (ratios of iterations per second, higher is better)
  - PBKDF2-sha1: 98.81%
  - PBKDF2-sha256: 99.04%
  - PBKDF2-sha512: 101.98%
  - PBKDF2-ripemd160: 103.76%
  - PBKDF2-whirlpool: 99.61%
- Cryptography (ratios of MiB/s, higher is better)
  - aes-cbc 128b: Encryption 100.05%, Decryption 99.86%
  - serpent-cbc 128b: Encryption 100.55%, Decryption 99.72%
  - twofish-cbc 128b: Encryption 99.48%, Decryption 99.84%
  - aes-cbc 256b: Encryption 99.90%, Decryption 99.86%
  - serpent-cbc 256b: Encryption 100.22%, Decryption 100.02%
  - twofish-cbc 256b: Encryption 99.95%, Decryption 99.68%
  - aes-xts 256b: Encryption 99.90%, Decryption 99.91%
  - serpent-xts 256b: Encryption 99.76%, Decryption 99.87%
  - twofish-xts 256b: Encryption 100.29%, Decryption 100.06%
  - aes-xts 512b: Encryption 99.82%, Decryption 99.88%
  - serpent-xts 512b: Encryption 99.89%, Decryption 99.98%
  - twofish-xts 512b: Encryption 100.09%, Decryption 99.97%
KFocus XE:
- Hashing (ratios of iterations per second, higher is better)
  - PBKDF2-sha1: 184.86%
  - PBKDF2-sha256: 242.50%
  - PBKDF2-sha512: 137.21%
  - PBKDF2-ripemd160: 132.29%
  - PBKDF2-whirlpool: 131.03%
- Cryptography (ratios of MiB/s, higher is better)
  - aes-cbc 128b: Encryption 243.08%, Decryption 219.97%
  - serpent-cbc 128b: Encryption 107.45%, Decryption 63.82%
  - twofish-cbc 128b: Encryption 127.84%, Decryption 121.63%
  - aes-cbc 256b: Encryption 246.11%, Decryption 226.85%
  - serpent-cbc 256b: Encryption 113.13%, Decryption 63.81%
  - twofish-cbc 256b: Encryption 128.98%, Decryption 121.44%
  - aes-xts 256b: Encryption 200.14%, Decryption 199.98%
  - serpent-xts 256b: Encryption 65.04%, Decryption 66.62%
  - twofish-xts 256b: Encryption 120.98%, Decryption 120.62%
  - aes-xts 512b: Encryption 223.57%, Decryption 224.02%
  - serpent-xts 512b: Encryption 69.00%, Decryption 67.18%
  - twofish-xts 512b: Encryption 119.89%, Decryption 119.93%

OpenJDK 8 Build Time

Optiplex, normal Ubuntu Server 23.04 (baseline):
- All stats: 100%
Optiplex, x86_64-v3 Lunar rebuild:
- Time (ratios of number of seconds, lower is better)
  - user: 99.86%
  - system: 99.31%
  - elapsed: 99.19%
- Misc
  - CPU usage (ratio, higher is probably better): 100.66%
  - maxresident k (lower is probably better): 90.63%
  - inputs (don’t know what this means): 59.57%
  - outputs (don’t know what this means): 100.29%
  - major pagefaults (lower is probably better): 325.14%
  - minor pagefaults (lower is probably better): 99.24%

Conclusion

In its current state, the performance of the x86_64-v3 rebuild of Lunar (in at least the tested workloads) is underwhelming. Some slight performance increases were seen in some areas, with some slight and likely negligible decreases in others. More research and testing may be needed for this endeavour to acheive it’s intended goal of a faster Ubuntu.

Comments

Yes, I actually built OpenJDK 8 of all things for my benchmarking. Why? I wanted to build some large codebase to see how fast it went. The build deps for KWin weren’t installable on the x86_64-v3 test rebuild, neither were the deps for the Linux kernel. OpenJDK8 however had all the deps it needed, so that’s what I built.

I tried to boot the x86_64-v3 rebuild on a couple of machines I knew were too old (an HP Chromebook G4 something-or-other, and an HP Elitebook 8570p) just to see what would happen. Both of them just stuck at a black screen and never unstuck. No error messages, no segfaults, no kernel panic, no fan spinup, no beeps, and no blinking LEDs. Just a black screen. (I do note however that pressing Caps Lock or Num Lock on the Elitebook while the boot was stuck did not result in the lights turning on or off. Additionally, on the Chromebook, the light on the flash drive went off and stayed off, on the Elitebook the light stayed on and never went off. And lastly, on the Chromebook, the screen was solid black, on the Elitebook a white cursor stuck in the upper-left corner, not blinking.) It might be useful to somehow integrate tests for x86_64-v3 support into the finished ISOs if we do end up making x86_64-v3 a supported separate architecture, so that an error can be displayed to the user if their hardware is too old.

I have no clue why my desktop so thoroughly stomped my much newer laptop in most of the serpent encryption benchmarks. Even the normal Ubuntu 23.04 Server left my laptop in the dust that area (without the x86_64-v3 changes).

thetick · December 14, 2023, 10:51am

Argh sorry double post

thetick · December 14, 2023, 11:34am

I think your results are expected.

For very earlier v3 designs 4th Gen Intel processors run AVX and AVX2 instruction at half the clock speed causing entire core run at half speed. For early v3 procs v3 builds are not good in general. If you run HPC benchmarks you would likely see slightly less performance hit with v3. Hmm the Xeon chips for these generations might run full clock speed ?

Recent chips like 11Gen and Ryzen etc AVX and AVX2 engines I seem recall run full speed so should see some nice benchmark improvements for some tasks.

Honestly Linus has been right for decades these vector instructions are for benchmark award winning. If you have massive vector calculations in HPC then use a GPU.

There are some nice newer instructions like AES-NI for much improved crypto / (de)compression. Running these early v3 processors at full speed with SSE2/3/4 better for any small number of CPU vector instructions instead of running the CPU half speed.

I’m not even going to bother running my workloads as they are mostly IPC (little encryption and no need for vector instructions) and will have some serious v3 performance hits on the older generation chips. If it’s bad I can always fakeout the /proc/cpuinfo and remove AVX detection so machines are falsely reported as v2. Some HPC will do this for high IPC as vector calculations are off loaded to GPUs.

Here is reference about Haswell 4th Generation. Looks like it was fixed in 5th Gen Intel Broadwell
https://en.wikichip.org/wiki/intel/frequency_behavior#Base.2C_Non-AVX_Turbo.2C_and_AVX_Turbo

Here is an explanation why for v4 the same issue is back with Skylake (8th Gen). This why v4 is of topic for this discussion and why 12th+ Gen Intel processors don’t have AVX512.
https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

arraybolt3 · December 15, 2023, 3:22am

oof, well that would make sense. Perhaps I should do some extra tests on my primary laptop and see what happens. If it really is just the fault of using barely-new-enough hardware, then that would explain a lot.

haroldw · December 16, 2023, 8:52pm

This makes me think that I need to test the v3 build on my old Latitude E6440 (Intel i7-4600M). Thankfully it hasn’t been my main pc since 2020 but I still run Ubuntu on it, and I still use it as a backup for my data (having put a recent SSD in it) and for its DVD drive.

sarnold · December 27, 2023, 4:20am

That script is shockingly gross. Whatever it is that it’s supposed to be doing, there’s got to be vastly better ways to do it.

arraybolt3 · December 27, 2023, 4:25am

And it looks dangerous on top of it all. I can’t read most of it well enough to know exactly what it does, but the sheer number of rm -rf commands in what look like sensitive directories (/usr/lib/firmware and /usr/include in particular) look like they could cause serious damage, and it apparently converts images to grayscale for some reason. It also is Arch-specific as evidenced by the pacman calls. This has nothing to do with Ubuntu and looks likely to cause serious breakage.

ian-weisser · December 27, 2023, 4:49am

MOD NOTE: The post referred to (“That script” and “it looks dangerous”) in the two most recent posts has been removed.

Thanks to the folks who flagged it.

johnandmegh · December 28, 2023, 4:28am

In case it’s helpful, the article below seems pretty well related to this topic and might prompt some further discussion?

https://www.phoronix.com/review/ubuntu-x86-64-v3-benchmark