Hi all,
At the Ubuntu Summit, we announced inference snaps, a framework for packaging and deploying silicon-optimized LLMs on Linux. At the time, we released two snaps for beta testing: one for Qwen 2.5 VL, and another for the DeepSeek R1 model.
We received constructive feedback from our silicon partners and the community, and incorporated many of those into our tooling. The improvements are expected to continue over the upcoming months, making inference snaps more user friendly and performant.
Today, we would like to announce a new snap, bringing Gemma 3 model optimization to embedded and desktop environments. The gemma3 inference snap is now available on the stable channel
sudo snap install gemma3
This detects your hardware and automatically deploys the most suitable model weights and runtime.
arm64 optimizations
The gemma3 snap provides new optimizations for arm64 platforms.
One of the models it can deploy is particularly suitable for smaller devices, such as Raspberry Pis. This is a 270 million parameter model trained and optimized by Google. On the Pi 4 and Pi 5, you can expect 12 and 36 tokens per second on average respectively, which is great for many edge use cases. You’ll get a 4 billion parameter model by default if your Pi has 4-16GB RAM, but you can pick the smaller one manually with: sudo gemma3 use-engine cpu-xsmall.
![]()
cpu-xsmallis the engine that bundles a CPU-optimized runtime and the extra small 270m model. The CPU optimizations are for a wide range of armv8 and armv9 processors.
If your arm64 device has an NVIDIA GPU, e.g. an NVIDIA DGX Spark, the snap deploys the relevant CUDA optimizations.
amd64 optimizations
Similar to the existing inference snaps, the gemma3 snap provides optimizations for Intel CPUs, Intel GPUs, and NVIDIA GPUs. More silicon targets are being validated.
Please provide your feedback here or via Github Discussions. If you face any issues, report it to support the maintainers and the community.