GPU Setup Guide#
Zerfoo supports GPU acceleration through three backends: NVIDIA CUDA, AMD ROCm, and OpenCL (via CLBlast). GPU support is entirely optional – go build ./... produces a working CPU-only binary with zero configuration.
Supported GPUs#
| Backend | Vendor | Library | Loaded Via |
|---|---|---|---|
| CUDA | NVIDIA | libcudart.so, libcublas.so, libcudnn.so | purego/dlopen |
| ROCm | AMD | libamdhip64.so, librocblas.so, libMIOpen.so | purego/dlopen |
| OpenCL | Any | libOpenCL.so (+ CLBlast) | purego/dlopen |
Tested hardware includes NVIDIA DGX Spark, RTX 4090, A100, and H100 GPUs. AMD MI250X and MI300X are supported via ROCm. Any GPU with OpenCL support can use the CLBlast backend.
How GPU Detection Works (purego/dlopen)#
Zerfoo uses purego/dlopen to load GPU libraries at runtime. There are no build tags required for GPU support – the same binary works on machines with or without a GPU.
At startup, Zerfoo calls dlopen("libcudart.so") (or the equivalent for ROCm/OpenCL). If the shared library is found, all CUDA runtime function pointers are resolved via dlsym. If the library is not found, Zerfoo falls back to CPU execution silently.
This means:
- No CGo – GPU calls go through resolved function pointers, not
cgowrappers. Zeroruntime.cgocalloverhead. - No build tags – You do not need
-tags cudato build a GPU-capable binary. The defaultgo build ./...produces a binary that detects GPUs at runtime. - Graceful fallback – If CUDA is not installed,
cuda.Available()returnsfalseand the engine falls back to CPU. No crashes, no missing symbol errors.
The detection flow:
1. dlopen("libcudart.so.12") or dlopen("libcudart.so")
2. dlsym each required symbol (cudaMalloc, cudaFree, cudaMemcpy, ...)
3. Optionally resolve CUDA graph symbols (cudaStreamBeginCapture, ...)
4. cuda.Available() returns true if all required symbols resolvedCUDA Setup (NVIDIA)#
Prerequisites#
- NVIDIA GPU with compute capability 5.0+
- NVIDIA driver 525+ (CUDA 12.x compatible)
- CUDA Toolkit 12.0 or later (
libcudart.so.12must be on the library path)
Verify GPU and driver#
nvidia-smiExpected output shows your GPU model, driver version, and CUDA version:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================================+========================+======================|
| 0 GR Blackwell On | 00000009:01:00.0 Off | 0 |
| N/A 40C P0 18W / 100W | 0MiB / 131072MiB | 0% Default |
+-----------------------------------------+------------------------+----------------------+Verify CUDA toolkit#
nvcc --versionVerify CUDA is visible to Zerfoo#
Run a quick Go program to check:
package main
import (
"fmt"
"github.com/zerfoo/zerfoo/internal/cuda"
)
func main() {
fmt.Println("CUDA available:", cuda.Available())
if cuda.Available() {
count, _ := cuda.GetDeviceCount()
fmt.Println("Device count:", count)
}
}Or from the CLI:
zerfoo run gemma-3-1b-q4 # defaults to CPUTo force GPU usage via the library API:
mdl, err := inference.LoadFile("/path/to/model.gguf",
inference.WithDevice("cuda"), // use GPU 0
inference.WithDevice("cuda:1"), // use GPU 1
)cuBLAS#
cuBLAS is included with the CUDA Toolkit. Zerfoo uses it for GEMM/GEMV operations. No additional installation needed.
cuDNN#
cuDNN provides accelerated convolution, batch norm, activation, pooling, and softmax operations. Install from the NVIDIA cuDNN archive:
# Ubuntu/Debian (CUDA 12)
sudo apt install libcudnn8 libcudnn8-dev
# Or download from NVIDIA: https://developer.nvidia.com/cudnncuDNN operations in Zerfoo are non-interface methods on *GPUEngine (not part of Engine[T]). Layers that want cuDNN acceleration type-assert to *GPUEngine and call these methods directly:
Conv2dForward– cuDNN convolution with grouped conv, bias, IMPLICIT_GEMMBatchNormForwardInference– spatial batch normCudnnActivationForward– ReLU, Sigmoid, TanhCudnnPoolingForward– Max, AvgIncPad, AvgExcPadCudnnSoftmaxForward– channel-mode softmax
Building custom CUDA kernels#
Zerfoo ships 25+ hand-written CUDA kernels for fused operations (RMSNorm, SwiGLU, RoPE, flash attention, quantized GEMM/GEMV). To build the custom kernel shared library:
cd internal/cuda/kernels && make shared && cd ../../../This produces libkernels.so, which is loaded at runtime via dlopen. The kernels are optional – without them, Zerfoo falls back to cuBLAS and element-wise GPU operations.
NCCL (multi-GPU gradient exchange)#
For distributed training with GPU-native gradient exchange, install NCCL:
# Ubuntu/Debian
sudo apt install libnccl2 libnccl-devNCCL enables direct GPU-to-GPU collective operations (AllReduce, Broadcast, Barrier) without CPU involvement. The NcclStrategy[T] in the distributed/ package uses NCCL for intra-node GPU-GPU communication and falls back to gRPC for inter-node transport.
TensorRT (optional)#
TensorRT provides inference optimization via graph-to-TRT conversion, engine caching, and half-precision builds:
mdl, err := inference.LoadFile("/path/to/model.gguf",
inference.WithBackend("tensorrt"),
inference.WithPrecision("fp16"),
)- Engine cache stored at
~/.cache/zerfoo/tensorrt/(keyed by modelID, precision, gpuArch) - Returns
UnsupportedOpErrorfor ops not yet mapped to TRT layers
ROCm Setup (AMD)#
Prerequisites#
- AMD GPU (MI250X, MI300X, or other ROCm-supported cards)
- ROCm 6.0+ (
libamdhip64.so,librocblas.so,libMIOpen.so)
Verify GPU#
rocm-smiExpected output:
========================= ROCm System Management Interface =========================
================================ Concise Info =======================================
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 42.0c 30.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%
=====================================================================================Verify ROCm version#
cat /opt/rocm/.info/version
# 6.0.0Use ROCm in Zerfoo#
mdl, err := inference.LoadFile("/path/to/model.gguf",
inference.WithDevice("rocm"),
)Ensure libamdhip64.so is on the library path:
ldconfig -p | grep amdhip
export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATHOpenCL Setup (CLBlast)#
OpenCL works with any GPU vendor that provides an OpenCL driver.
Install OpenCL and CLBlast#
# Verify OpenCL is available
clinfo | head -20
# Ubuntu/Debian
sudo apt install libclblast-dev ocl-icd-opencl-dev
# Fedora
sudo dnf install clblast-devel ocl-icd-develUse OpenCL in Zerfoo#
mdl, err := inference.LoadFile("/path/to/model.gguf",
inference.WithDevice("opencl"),
)Multi-GPU#
Device selection#
Specify a device ID to target a specific GPU:
inference.WithDevice("cuda:0") // first GPU
inference.WithDevice("cuda:1") // second GPUDevice affinity#
The CUDA stack threads a device ID through all GPU objects:
- GPUEngine –
deviceIDfield; callscuda.SetDevice()before dispatching kernels - GPUStorage –
deviceIDfield;DeviceID()method exposes which GPU a tensor resides on - MemPool – per-device-per-size cache prevents cross-device pointer reuse
- CudaAllocator – binds allocations to the correct GPU via
SetDevice()
Parallelism strategies#
| Strategy | Complexity | Use Case |
|---|---|---|
| Data parallelism (same model on each GPU, split batches) | Medium | Training, batch inference |
| Pipeline parallelism (different layers on different GPUs) | High | Large models exceeding single GPU memory |
For distributed training across multiple GPUs or nodes, see the distributed/ package which provides gRPC + NCCL gradient exchange. NCCL handles intra-node GPU-GPU communication; gRPC handles inter-node transport.
Flash attention#
Flash attention fuses Q*K^T -> scale -> softmax -> V into a single tiled CUDA kernel. Memory drops from O(n^2) to O(n) and four kernel launches collapse to one.
- Online softmax via log-sum-exp trick (no full n x n scores matrix)
- Shared memory staging for K/V tiles (BLOCK_SIZE=64)
- Causal masking via tile skipping + per-element mask
- MAX_HEAD_DIM=128 (covers Gemma, Llama, Mistral, Qwen, Phi)
- Automatically used by GQA and MLA when mask is nil and data is on GPU
- Falls back to naive attention for: CPU data, head_dim > 128, arbitrary masks
Performance Tuning#
Quantization choice#
Lower-bit quantization reduces memory and increases throughput at the cost of quality:
| Quantization | Bits/Weight | Memory | Speed | Quality |
|---|---|---|---|---|
| FP32 | 32 | Baseline | Slow | Best |
| FP16 | 16 | 2x smaller | Fast | Near-best |
| Q8_0 | 8 | 4x smaller | Faster | Good |
| Q4_K_M | 4 | 8x smaller | Fastest | Acceptable |
For most use cases, Q4_K_M provides the best speed/quality tradeoff. Zerfoo achieves 241 tok/s on Gemma 3 1B Q4_K_M on a DGX Spark (19% faster than Ollama on the same hardware).
Compute precision#
Use FP16 compute precision for faster GPU inference with minimal quality loss:
mdl, err := inference.LoadFile("/path/to/model.gguf",
inference.WithDevice("cuda"),
inference.WithDType("fp16"), // FP16 activations
inference.WithKVDtype("fp16"), // FP16 KV cache (halves bandwidth)
)Context length#
Longer context lengths increase KV cache memory. Reduce MaxSeqLen if you don’t need full context:
mdl, err := inference.LoadFile("/path/to/model.gguf",
inference.WithDevice("cuda"),
inference.WithMaxSeqLen(2048), // default varies by model (up to 128K)
)CUDA graph capture#
Zerfoo automatically captures the inference computation graph as a CUDA graph when CUDA 10.0+ graph APIs are available. This eliminates per-token kernel launch overhead and covers 99.5% of instructions on the GGUF inference path. No configuration needed – it is enabled by default.
VRAM requirements#
| Model | Quantization | VRAM |
|---|---|---|
| Gemma 3 1B | Q4_K_M | ~1 GB |
| Llama 3 8B | Q4_K_M | ~5 GB |
| Mistral 7B | Q4_K_M | ~5 GB |
| Qwen 2.5 7B | Q4_K_M | ~5 GB |
Troubleshooting#
GPU not detected (CUDA)#
Symptom: cuda.Available() returns false even though nvidia-smi shows a GPU.
Fix: Ensure libcudart.so is on the library search path:
# Check if libcudart is findable
ldconfig -p | grep cudart
# If not found, add the CUDA lib directory
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHCUDA version mismatch#
Symptom: dlopen libcudart failed or dlsym cudaMalloc failed.
Fix: Zerfoo looks for libcudart.so.12 first, then libcudart.so. Ensure your CUDA toolkit is version 12.x:
ls /usr/local/cuda/lib64/libcudart.so*
# Should show libcudart.so.12.x.xIf you have CUDA 11.x, create a symlink (not recommended – upgrade to CUDA 12):
sudo ln -s /usr/local/cuda/lib64/libcudart.so.11.8 /usr/local/cuda/lib64/libcudart.so.12Out of memory (OOM)#
Symptom: CUDA error: out of memory during model loading or inference.
Fix: Use a more aggressively quantized model or reduce context length:
# Check GPU memory usage
nvidia-smi
# Use Q4 quantization (smallest memory footprint)
zerfoo pull gemma-3-1b-q4
zerfoo run gemma-3-1b-q4From the API, reduce max_tokens to lower peak memory:
mdl, err := inference.LoadFile("/path/to/model.gguf",
inference.WithDevice("cuda"),
inference.WithMaxSeqLen(2048), // reduce from default 8192
)ROCm not detected#
Symptom: ROCm device requested but binary built without rocm build tag
Fix: Check that libamdhip64.so is on the library path:
ldconfig -p | grep amdhip
export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATHOpenCL not detected#
Fix: Ensure libOpenCL.so and CLBlast are installed:
ldconfig -p | grep -i opencl
# If missing:
sudo apt install ocl-icd-opencl-dev libclblast-dev