Framework Desktop: How to Expand your Unified Memory For LLM Use
If you own a Framework Desktop, or think about owning one (check our review!), this post may be useful to help you get the most out of your gear. If you are considering the Framework Desktop to run Large Language Models (LLMs) at home (which is an excellent choice, especially if you use Mixture of Experts models), you want top get the 128 GB RAM model. And after that, you certainly want to maximize the memory available to host such models.

By default, out of the box, on you typical Fedora install, the available memory is split in two parts: 64GB for the iGPU and the rest for the system RAM. This is already good enough to run a bunch of models, but if you really want to make the most out of it, and use for example GPT-OSS 120b with a lot of context, you will be very short at 64 GB VRAM - as the model already takes, by default, about 60 GB in memory to start with.
First stop: the BIOS
Hit F2 when starting your machine to go into the BIOS, go into advanced settings, until you see the memory allocation for the iGPU. There are 4 different options available. The most drastic one consists in giving the strict minimum for the iGPU (0.5 GB RAM) but this is not going to cut it for me, as I also want to use the Framework Desktop for gaming on the side. So I go for the 32GB option for the iGPU, which still gives me more than 90 GB available for the unified memory.
Note that this setting is necessary, but not sufficient.
Next stop: OS Settings
You just need to run the following commands to set 90 GB of Unified VRAM, and then reboot:
sudo grubby --update-kernel=ALL --args='ttm.pages_limit=23040000'
sudo grubby --update-kernel=ALL --args='amdttm.page_pool_size=23040000'
sudo reboot
In case you want to adjust to a different amount of VRAM, you can follow this formula: total vram in MB * 1024 * 1024 / 4096
In our case: 90000 * 1024 * 1024 / 4096 equals 23040000.
After reboot, you can inspect the setup and confirm it worked as expected:
sudo dmesg | grep "amdgpu.*memory"
And you should see the following result:

As you can see the second line clearly shows the 90GB of unified memory allocation.
And we can prove that it works by loading the full model of GPT OSS 120b, which takes more than 70GB VRAM with the 120k context window.

And we can see that the full GPU offload is working as we get close to 50 tokens/s (base speed when little context is used, will decrease as the conversation context increases, of course).

Very impressive!
At 50 tokens per second, you can do a LOT of things with this kind of hardware. We’ll talk use cases pretty soon.