Intro
This is a report of getting an RTX 6000 Pro to work in Ubuntu using an atypical setup via a Thunderbolt dock into a Framework Desktop. This is not a tutorial, in that what is presented here is relatively risky as it could render the machine inoperable, and also that the instructions are at times hinting at what needs to be done, but not hand-holding. The report starts with some personal context, the Framework Desktop, and then dives into the problem solving the GPU setup.
The Framework Desktop
Last year I took a departure from the longstanding laptop-focused personal device rule and got myself a Framework Desktop. I was mainly looking for experimenting with the AMD Strix Halo with its 128GB unified LPDDR5x-8000 memory and relatively fast integrated GPU (fast considering the context), and Framework did a great job on the package with it being so small (itās actually a Mini-ITX mainboard) and almost completely silent with the Noctua fan option. So itās a surprisingly powerful device for what it is, and still keeps the mobility I care for (it even has a handle!).
Fast forward some months, and the experiment went well. The Strix Halo can actually be used for most things I cared to experiment with, such as medium-sized LLM models and image/audio/video generation. Gemma 27B at BF16, which is around 50GB, works fine, for instance. We wonāt get pages of text at a time, but itās fast enough to have an interactive chat without much frustration. LTX-2 can also produce up to 20 second videos with audio and lip-sync in a surprisingly short amount of time, and kudos to them on this, as no other āopenā weight model is nearly as fast, nor does lip-sync as well for now.
TIPS:
¹ With the current kernel in 25.10, we need the amdgpu.cwsr_enable=0 in the cmdline, as otherwise weāll get kernel panics every other minute while trying to use the GPU for anything meaningful. The team is aware and the fix is going to be out any time now.
² Use the ttm module (Translation Table Manager) to setup actual unified memory on Linux, instead of splitting the GPU memory. It works very well. Specifically, we want the
pages_limitandpage_pool_sizeoptions.
Going beyond
Once the initial experimentation was successful, it was time to go one step forward and start training some models, and thatās where the difference between the specifications of the compact Strix Halo and the more powerful generation of current GPUs becomes clearly noticeable. I wonāt go into details of the task at hand right now as itās not the focus here, but while batch-processing a single document out of a few thousand, I need to break it down into chunks formatted for processing, and a single sample I picked for experimentation has over 400 chunks. With a very rough estimate, if it took 10 seconds on average per chunk, itās several months of processing non-stop, for a single try. Doesnāt work.
So I went shopping for options, in the worst possible moment in terms of prices for GPUs and memory in general. Itās not just very expensive but itās also quite hard to find. In the end I went for the RTX 6000 Pro, as it was the one I could find (a couple of countries away from me), and even being as expensive as it is (probably because of it), it hasnāt actually seen its price raise significantly compared to last year. Also, at 96GB of GDDR7, itās got the most amount of memory available for a ādesktopā GPU. Beyond that we are in server land, tens of thousands, and noise.
So Iāve put the order. I was anxious until the small package arrived in good condition, I must admit.
Going external
This will be a controversial choice for many Iām sure, as the forums are filled with frustration and strong recommendations of not going this route, and this report is evidence that they are right. Besides the hassle, the bandwidth between the computer and the GPU is of course subpar if compared with an internal PCIe slot.
But, this is really the setup I wanted. To begin with, I donāt even have a proper chassis to fit a board this size into, nor do I want one. Second, the Framework Desktop fulfills most needs with 128GB of fast RAM and fast storage, and third, in an ideal scenario I want the board to be able to roam across machines instead of being locked inside a single large desktop. Finally, the bandwidth isnāt an issue for me, as my use cases typically mean sending large models to the device, and then they stay there.
So that is the motivation, and with it I went shopping again and found two compelling options just released in the last couple of months: the DEG1 from Minisforum, and the EG2 from Aoostar. They are both Thunderbolt 5 devices with an option for an Oculink connection. They retail for about the same price, but the DEG1 has additional connectivity (more USB ports, a 2.5G ethernet, and an M2 slot), so I went with this as a start.
One relevant detail: I donāt actually have a Thunderbolt 5 computer, only two Thunderbolt 4 ones. Perhaps this will make a difference in some of these details, but I cannot tell right now.
So the dock arrived, the board arrived, the very silent 1500W PSU arrived, and it was time to put it all together.
How to not make it work
I was expecting it to not work early on, but wasnāt expecting it to fail so harshly and so stubbornly. The whole machine immediately crashes as soon as the external device is turned on and we try to touch the driver anyhow. If we go looking for information online, most discussions will eventually point to some combination of these kernel arguments:
pci=realloc
pci=hpbussize=0x33
pci=hpmmioprefsize=128G
pci=hpmemsize=128G
pcie_ports=native
These options are indeed on the right track, but Iāve spent more hours than Iām willing to admit while testing these out, and they donāt actually solve the problem. At least not with this board and this amount of VRAM on this system.
So what follows is my understanding of what is actually happening, but first let me put a note here:
DISCLAIMER
Iām far from being a developer on this area, so take all of this with a grain of salt. If you reproduce the reported steps you will be fiddling with an expensive device in ways that may render it temporarily or permanently broken. Assume I have no idea about what Iām doing, and youāll be doing the same.
In particular, if you play with BARs (directly or indirectly) in kernels prior to 6.19, thereās a high chance that youāll end up with a PCI configuration in a corrupted state, and rebooting wonāt help. If that happens to you, try resetting the motherboard (CMOS). Kernel 6.19, which is available in the upcoming 26.04 and available in the in-development archives, significantly improves the stability in this exact area and prevents many cases of corruption. My tests were done with a 25.10 installation, which got repeatedly corrupted, and later with the kernel and nvidia modules from 26.04-dev, which was much easier to work with.
Kernel cmdline and module options
Before diving in, here is my final working kernel cmdline options:
pci=realloc,hpbussize=0x33 pcie_aspm.policy=performance iommu=pt
At a high level:
-
pci=hpbussize - More numbers so we can hotplug more (why 0x33? everyone uses this, but probably not relevant)
-
pci=realloc - Let the kernel try its best at giving devices what they want (spoiler, it doesnāt always)
-
pcie_aspm.policy=performance - Ask the PCI power management subsystem to never slow down (spoiler, it does)
-
iommu=pt - Get out of the way of the IO MMU hardware when possible
Kernel module options:
options nvidia NVreg_EnableResizableBar=1
blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_uvm
blacklist nvidia_modeset
blacklist snd_hda_intel
The resizable BAR option is discussed in depth below, and the blacklisting allows us to work on the PCI subsystem before the drivers kick in. Weāll load them manually later.
Hotplugging PCI Express
With that out of the way, here is what seems to be happening: when you plug the PCI board via Thunderbolt, youāre making use of a PCI bridge in the host system that is reserved for hotplugging. That bridge has a certain amount of address space allocated for it, so that it can map the memory of unknown incoming devices. The problem is that this mechanism was not originally designed to operate with devices with hundreds of GBs of memory, so all the common wisdom from the last many years, and the software that was designed around it, is failing.
The information from /proc/iomem is one of the key ways you can investigate this system area. The Framework Desktop has a section that looks similar to this:
4000000000-a0200fffff : PCI Bus 0000:00
4000000000-5fffffffff : PCI Bus 0000:60
6000000000-7fffffffff : PCI Bus 0000:01
8000000000-801fffffff : PCI Bus 0000:c3
8000000000-801fffffff : 0000:c3:00.0
8020000000-80200fffff : PCI Bus 0000:c4
8020000000-802007ffff : 0000:c4:00.1
The information here says that we have some empty bridges with 128GB each (40⦠=> 5fā¦). We know these are the relevant bridges because weāll see activity when the GPU is connected, but we can also look further into them. With lspci -vt we can see the hierarchy of the bridges in the system, for instance:
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Strix/Strix Halo Root Complex
+-00.2 Advanced Micro Devices, Inc. [AMD] Strix/Strix Halo IOMMU
+-01.0 Advanced Micro Devices, Inc. [AMD] Strix/Strix Halo Dummy Host Bridge
+-01.1-[01-5f]--
+-01.2-[60-be]--
+-02.0 Advanced Micro Devices, Inc. [AMD] Strix/Strix Halo Dummy Host Bridge
...
Here we see that 01 and 60 are directly under the root, and their responsible devices are respectively 00:01.1 and 00:01.2. Letās investigate them with lspci -vs <id>:
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Strix/Strix Halo PCIe USB4 Bridge (rev 02) (prog-if 00 [Normal decode])
Subsystem: Advanced Micro Devices, Inc. [AMD] Strix/Strix Halo PCIe USB4 Bridge
Flags: bus master, fast devsel, latency 0, IRQ 33, IOMMU group 2
Bus: primary=00, secondary=01, subordinate=5f, sec-latency=0
I/O behind bridge: 7000-afff [size=16K] [16-bit]
Memory behind bridge: 98000000-afffffff [size=384M] [32-bit]
Prefetchable memory behind bridge: 6000000000-7fffffffff [size=128G] [32-bit]
...
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Strix/Strix Halo PCIe USB4 Bridge (rev 02) (prog-if 00 [Normal decode])
Subsystem: Advanced Micro Devices, Inc. [AMD] Strix/Strix Halo PCIe USB4 Bridge
Flags: bus master, fast devsel, latency 0, IRQ 34, IOMMU group 3
Bus: primary=00, secondary=60, subordinate=be, sec-latency=0
I/O behind bridge: 3000-6fff [size=16K] [16-bit]
Memory behind bridge: 80000000-97ffffff [size=384M] [32-bit]
Prefetchable memory behind bridge: 4000000000-5fffffffff [size=128G] [32-bit]
...
Some key details there:
Strix Halo PCIe USB4 Bridge
Prefetchable memory behind bridge: 6000000000-7fffffffff [size=128G]
In other words, these are the USB4 PCIe bridges and each of them has 128GB of prefetchable address space for whatever is plugged in. Sounds great as we only need 96GB.
Thereās also another small detail here that is easy to miss, but points to an important characteristic of that whole system: the first device (00:01.1) is using address space (60⦠=> 7fā¦), while the second one is using (40.. => 5ffā¦). In other words: the address space allocations are not contiguous, and the hierarchy we see under /proc/iomem does not necessarily match the actual bridge hierarchy. If we forget this, itās very easy to make wrong assumptions.
Anyway, now that we have an idea of how the standard address space looks like, letās plug the GPU and see what happens. Reminder: we blacklisted the nvidia* and sound modules earlier so that the PCI device is left alone while we play with it.
4000000000-a0200fffff : PCI Bus 0000:00
4000000000-5fffffffff : PCI Bus 0000:60
6000000000-7fffffffff : PCI Bus 0000:01
6000000000-7fffffffff : PCI Bus 0000:02
6000000000-6017ffffff : PCI Bus 0000:03
6000000000-600fffffff : 0000:03:00.0
6010000000-6011ffffff : 0000:03:00.0
6018000000-65654fffff : PCI Bus 0000:04
6565500000-6ab29fffff : PCI Bus 0000:37
8000000000-801fffffff : PCI Bus 0000:c3
8000000000-801fffffff : 0000:c3:00.0
8020000000-80200fffff : PCI Bus 0000:c4
8020000000-802007ffff : 0000:c4:00.1
...
Wait, what⦠why do we have several new bridges now?
Indeed, this is an important part of the problem. The Thunderbolt PCIe dock is actually reporting multiple PCI bridges, so our scarce address space is now even scarcer by being shared with several bridges. This also makes it clear why the current way in which Linux offers manual fine tuning is unsuitable for these devices: hpmmioprefsize is a massive dial that says āreserve that address for me on hotplug bridgesā, but what does that even mean in this situation? We donāt want 128GB on any one of these bridges, we want 128GB for one particular device.
It would make sense to think that using pci=realloc would fix those issues, as thatās what it says on the tin:
realloc= Enable/disable reallocating PCI bridge resources if allocations done by BIOS are too small to accommodate resources required by all child devices.
Unfortunately, it looks like pci=realloc does not like to reallocate across bridges. At least nothing Iāve done could convince it to do that, and we end up fighting these messages in the logs, even with bridges being apparently empty:
[ 20.844855] nvidia 0000:03:00.0: BAR 1 [mem 0x6000000000-0x600fffffff 64bit pref]: releasing
[ 20.845112] nvidia 0000:03:00.0: BAR 3 [mem 0x6010000000-0x6011ffffff 64bit pref]: releasing
[ 20.845243] pcieport 0000:02:00.0: bridge window [mem 0x6000000000-0x7fffffffff 64bit pref]: releasing
[ 20.845244] pcieport 0000:01:00.0: bridge window [mem 0x6000000000-0x7fffffffff 64bit pref]:
was not released (still contains assigned resources)
[ 20.845246] pcieport 0000:00:01.1: bridge window [mem 0x6000000000-0x7fffffffff 64bit pref]:
was not released (still contains assigned resources)
[ 20.845249] pcieport 0000:01:00.0: Assigned bridge window [mem 0x6000000000-0x7fffffffff 64bit pref] to [bus 02-5f]
cannot fit 0x3000000000 required for 0000:02:00.0 bridging to [bus 03]
Thereās a chance that asking the kernel to ignore the BIOS bridge windows altogether with pci=nocrs could fix this for someone, but at least with the Framework Desktop the system breaks down in multiple ways and stops booting altogether, so not an option here.
Thereās another important complicating factor that deserves its own section:
Resizable BARs
There are important details left out from the above notes for simplicity, but that are already hinted in the data itself: the board is not in fact asking for 128GB initially, but instead for only 256MB, and itās getting it (4000⦠=> 400fā¦). We can confirm that information by listing the device itself with lspci -vs 03:00.0, and we get something similar to this:
03:00.0 VGA compatible controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Workstation Edition] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 204b
Flags: bus master, fast devsel, latency 0, IRQ 192, IOMMU group 38
Memory at 98000000 (32-bit, non-prefetchable) [size=64M]
Memory at 6000000000 (64-bit, prefetchable) [size=256M]
Memory at 6010000000 (64-bit, prefetchable) [size=32M]
I/O ports at 7000 [size=128]
...
We can think of resizable BARs as precisely the method that allows such address hungry devices to still function in a world that wasnāt ready for it. You can talk to the device, and you can use it for certain purposes just fine, but you cannot operate its memory at speed unless you plug it into the IO MMU properly. So the fact itās not asking for 128GB upfront is a good thing³. The problem is that the device then asks the kernel for more, and cannot get it.
³ Thereās actually a way to force it to request 128GB upfront, with no BAR resizing. Youāll need the Display Mode Selector tool from Nvidia, and this operation changes the board persistently. I was not brave enough to try this, as clearly my hardware does not like giving away 128GB chunks of address space, and I was afraid that I would get myself into a much worse position.
The other complicating factor is that BAR1, the one that wants to resize to 128GB, is not the only prefetchable address range the GPU is asking for. We have BAR3 after it, asking for 32MB as well. The size is completely irrelevant in this context, but it creates a larger issue with BAR resizing, because the resize happens in powers of two. So we have 96GB, which becomes 128GB by itself, and the silly 32MB pushes it beyond what any one of the slots weāve seen above are offering. Indeed, if we look again at the realloc failure logs above, weāll see the board is attempting to reallocate the slot to 0x3000000000, which is 192GB, not 128GB. I havenāt looked at the code, but thatās the only explanation I have so far.
So at this stage I started to wonder if it was possible to make it work without significant code changes, but then I found out that I could actually manipulate the bridge registers manually, and things got interesting again.
PCI Configuration Space
DISCLAIMER, AGAIN: This is where your machine may stop working if you mess things up, and the information here is not even enough for copy & pasting and having something working, so please use these details as a resource rather than as a tutorial.
The PCI Configuration Address Space, or Configuration Space for short, is the way the PCI spec refers to the standardized registers that allow software to set up the PCI busses and their devices. Using these registers, we can actually go behind the kernelās backā“, and change how the bridge address allocation looks, and make it work.
ā“ The kernel often does not like us going behind its back.
You will probably need to disable secure boot, for instance, as the kernel blocks these changes otherwise.
So letās look at the allocation situation again, and draft a plan:
4000000000-a0200fffff : PCI Bus 0000:00
4000000000-5fffffffff : PCI Bus 0000:60
6000000000-7fffffffff : PCI Bus 0000:01
6000000000-7fffffffff : PCI Bus 0000:02
6000000000-6017ffffff : PCI Bus 0000:03
6000000000-600fffffff : 0000:03:00.0
6010000000-6011ffffff : 0000:03:00.0
6018000000-65654fffff : PCI Bus 0000:04
6565500000-6ab29fffff : PCI Bus 0000:37
With that, here is one possible way to make it work (out of many):
To move the 64 bit ranges of the busses themselves we need to use registers 0x24, 0x28, and 0x2c, because these are type 1 devices (PCI to PCI bridges). Iām going to paste here the small script I use for this, but if you ever decide to do something like this, please get yourself familiarized with the setpci tool and find some reasonable documentation for the PCI registers, before running anything like this.
#!/bin/bash
set -ex
BRIDGE_BUS="0000:$1"
DEVICE="$(basename $(readlink /sys/class/pci_bus/$BRIDGE_BUS/device))"
test -d /sys/bus/pci/devices/$DEVICE
A="$2"
B="$3"
HIGH32A="$( printf "%08x" $(( 0x$A >> 32 )) )"
HIGH32B="$( printf "%08x" $(( 0x$B >> 32 )) )"
MID12A="$( printf "%04x" $(( (0x$A & 0xfff00000) >> 16 | 0x0001 )) )"
MID12B="$( printf "%04x" $(( (0x$B & 0xfff00000) >> 16 | 0x0001 )) )"
# Disable memory (bit 1) and IO (bit 2), preserve the rest.
OLDFLAGS="$( setpci -s $DEVICE 04.w )"
setpci -s $DEVICE 04.w=00:03
# Expand bridge window from A to B.
setpci -s $DEVICE 24.l="$MID12B$MID12A"
setpci -s $DEVICE 28.l="$HIGH32A"
setpci -s $DEVICE 2c.l="$HIGH32B"
# Reenable memory and IO, preserve the rest.
setpci -s $DEVICE 04.w=$OLDFLAGS
# Remove device.
echo 1 > /sys/bus/pci/devices/$DEVICE/remove
Iām using the command register 0x04 to disable/enable the memory and IO access while ranges are shifting, but in theory Iām guessing nothing should be using these devices right now (per above, the nvidia* modules are blacklisted). Also, itās worth noting how at the end the device is being āremovedā. Thatās the way I found to make the kernel forget what it knows, and go back to configuration space to look for the data again later when we rescan it. If we donāt remove, it gets back to the prior state on rescan.
This script then allows us to do something like this:
# Drop address range from bridge 60 (base > limit)
rewindow.sh 60 ffffffffff 0000000000
# Expand bridge 01 to cover both ranges
rewindow.sh 01 4000000000 7fffffffff
# Rescan the new configuration.
echo 1 > /sys/bus/pci/rescan
With this alone 01 and 60 end up like this:
4000000000-a0200fffff : PCI Bus 0000:00
4000000000-7fffffffff : PCI Bus 0000:01
...
8000000000-9fffffffff : PCI Bus 0000:60
So the kernel respected the requested range for 01, and found a different range to allocate 60, which is fine by me.
Using the same mechanism, Iāve moved bus 02 and 03 over, except that 03 needs an extended space that covers not only the 128GB needed by BAR1, but also the extra 32MB of BAR3. The end result was this:
4000000000-a0200fffff : PCI Bus 0000:00
4000000000-7fffffffff : PCI Bus 0000:01
4000000000-7fffffffff : PCI Bus 0000:02
4000000000-6017ffffff : PCI Bus 0000:03
4000000000-400fffffff : 0000:03:00.0
4010000000-4011ffffff : 0000:03:00.0
...
The GPU got correctly moved to the start of the new window, but BAR1 has not been resized yet, and needs fixing.
At this point there are two ways of fixing the GPU address range: we can do something similar to rewindow.sh above, except using the registers for type 0 devices (it worked when I tried). Or, we can do something slightly nicer by asking the kernel through the actual BAR resizing interface:
echo 17 > /sys/class/pci/0000:03:00.0/resource1_resize
The number 17 comes from 2^20 * 2^17, so 128GB (why is it not actually 37, though⦠no idea).
The final result should be similar to this:
4000000000-a0200fffff : PCI Bus 0000:00
4000000000-7fffffffff : PCI Bus 0000:01
4000000000-7fffffffff : PCI Bus 0000:02
4000000000-6017ffffff : PCI Bus 0000:03
4000000000-5fffffffff : 0000:03:00.0
6000000000-6001ffffff : 0000:03:00.0
...
Those memory ranges should get us a properly mapped device. Phew. We are ready to load the nvidia modules, and get the device working! Or, almost⦠just one more issue.
Power management
If we load the kernel module now, we indeed get a correct allocation of BAR1 and device is happy about it. But thereās still an issue related to, as best as I can tell, the power management of the PCI bus.
The easiest way to observe the symptom of the issue is via the performance status for the PCI bus. We can do that both via nvidia-smi and via lspci. Using nvidia-smi, which is easier to read, youāll find something similar to this as soon as you load the module:
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 5
Host Max : 1
Link Width
Max : 16x
Current : 4x
...
GPU 00000000:03:00.0
Performance State : P0
This is perfect. Gen 4 is as good as it gets given that weāre strangling it via Thunderbolt 4. The Link Width, also same thing, this is good considering the context. The Host Max of 1 looks like a bug (how is max 1 when the current is 4), but irrelevant for us now. Performance of P0 means the GPU is at max setting, great.
So if those settings are good, then whatās the issue? If you wait a few seconds, all these settings will fall down to their lowest levels, and not come back. Gen goes to 1, performance goes to P8, and they stay there. If we try to use the GPU meaningfully, we get an instant panic and reboot.
I havenāt looked further, but my guess given the symptoms is that power management is down-clocking and down-powering the pipeline significantly, and when the GPU has a sudden spike of activity, either the driver or the actual hardware (GPU? PCI bus?) cannot keep up with demand and crashes harshly.
Given that guess, Iāve tried three different things, and the first works poorly, the second works okay with a caveat, and the third works properly.
The first one that works poorly is using an unsupported module option for nvidia:
options nvidia NVreg_RegistryDwords="RMForcePstate=0"
This works, stops the crashes, Gen is held at 4, performance is held at P0 or P1, you can use the GPU. The problem is that the power is capped at 70W for some reason, so it is vaguely as fast as the Strix Halo GPU, which is an interesting data point but not the goal. Again, this is an internal setting and unsupported, so we cannot speak of bugs here. We donāt even know why it actually exists.
Then, the second that does work and gets you the full GPU, is to lock the clocks of the GPU high as soon as you load the module:
nvidia-smi -pm 1 && nvidia-smi -lgc 1000,3000
This locks the graphics clocks between 1 and 3GHz, and 1GHz seems to be enough to force the PCI bus to stay in Gen 4, and the performance of the GPU to not come down from P0 or P1 either. The caveat is obvious, though: if we donāt power down, we donāt power down. The GPU hovers around 40W, which is not ideal.
The third option then gets us properly functioning power management and still no crashes. After fiddling further with related options the PCI configuration space, I found out that that in the link control 2 register we have bit 5 with meaning auto speed disable. Sounds familiar.
Setting it is straightforward:
sudo setpci -s 03:00.0 CAP_EXP+30.w=0020:0020
# Load the module, set persistent state.
nvidia-smi -pm 1
Now the GPU will go all the way down to P8 with 4W, and still be able to ramp up to full performance at 600W without crashing.
It worksā¦
After this complicated dance, we have the full 600W of performance from a functioning RTX 6000 Pro over USB4. All tests so far worked well, and the fact itās external goes pretty much unnoticed at least as far as my own uses cases show for now. Upcoming work includes a more resilient automated setup, tests in other hosts, trying to get it to come up and down from a lower power state (fixed), and discussions with Canonicalās kernel team to see what we might improve in the short term to make this process less painful.