Model Loading and Architecture Support#

This tutorial covers the GGUF model format, the architectures Zerfoo supports, how to load models programmatically with various options, and what quantization levels mean for memory and quality.

The GGUF Format#

GGUF (GPT-Generated Unified Format) is a single-file model format designed for efficient inference. A GGUF file contains everything needed to run a model:

Metadata: architecture name, vocabulary size, hidden dimensions, RoPE parameters, chat template, and more.
Tokenizer: the full BPE vocabulary and merge rules embedded in the file’s metadata section.
Tensors: all model weights, stored in their quantized or full-precision representation with shape information.

Zerfoo uses GGUF as its sole model format. When you call inference.LoadFile, the framework parses the GGUF header, extracts the tokenizer, reads the architecture metadata, and builds a typed computation graph – all without any external config files.

model, err := inference.LoadFile("path/to/model.gguf")

GGUF files are memory-mapped by default. Zerfoo maps the file into virtual address space and lets the OS page tensor data from disk on demand — no weights are copied into heap memory at startup. This gives near-instant load times regardless of model size and allows loading models larger than physical RAM.

// mmap is the default — no options needed
model, err := inference.LoadFile("model.gguf")

// Opt out for heap loading (required for CUDA graph capture)
model, err := inference.LoadFile("model.gguf",
	inference.WithMmap(false),
)

Split GGUF files (the -NNNNN-of-NNNNN.gguf naming convention used for 70B+ models from HuggingFace) are detected and loaded automatically. Pass the path to the first shard — Zerfoo finds the rest.

// Load a 138 GB model (3 shards) on a 128 GB machine
model, err := inference.LoadFile("MiniMax-M2-Q4_K_M-00001-of-00003.gguf")

Supported Architectures#

Zerfoo includes architecture-specific graph builders for each model family. The architecture is detected automatically from GGUF metadata – you do not need to specify it.

Architecture	Key Features	Example Models
Llama 3	RoPE theta=500K, GQA	Llama 3.2 1B/3B, Llama 3.1 8B/70B
Llama 4	Extended Llama architecture	Llama 4 Scout
Gemma 3	Tied embeddings, embedding scaling, QK norms, logit softcap	Gemma 3 1B/4B/12B/27B
Gemma 3n	Gemma 3 nano variant	Gemma 3n
Mistral	Sliding window attention	Mistral 7B v0.3
Mixtral	Mixture of experts (MoE) with sliding window	Mixtral 8x7B
Qwen 2	Attention bias, RoPE theta=1M	Qwen 2.5 7B/14B/72B
Phi 3/4	Partial rotary factor	Phi-3 Mini, Phi-4
DeepSeek V3	Multi-head Latent Attention (MLA), batched MoE	DeepSeek V3
Falcon	Multi-query attention	Falcon 7B/40B
Command-R	Retrieval-augmented generation architecture	Command-R
Jamba	Hybrid Mamba-Transformer architecture	Jamba
Mamba/Mamba3	State-space model (SSM), no attention	Mamba
LLaVA	Vision-language multimodal	LLaVA

Each architecture has a dedicated builder in the inference/ package (e.g., arch_llama.go, arch_gemma.go, arch_deepseek.go). The builder reads architecture-specific metadata fields and constructs the computation graph with the correct layer structure, attention mechanism, and normalization.

Loading Models Programmatically#

The inference.LoadFile function accepts functional options that control device placement, precision, and sequence length.

Device Selection#

// CPU inference (default).
model, err := inference.LoadFile("model.gguf")

// CUDA GPU inference.
model, err := inference.LoadFile("model.gguf",
	inference.WithDevice("cuda"),
)

Compute Precision#

// FP16 compute -- activations are converted F32->FP16 before GPU kernels.
model, err := inference.LoadFile("model.gguf",
	inference.WithDevice("cuda"),
	inference.WithDType("fp16"),
)

// FP8 quantization -- weights are quantized to FP8 E4M3 at load time.
model, err := inference.LoadFile("model.gguf",
	inference.WithDevice("cuda"),
	inference.WithDType("fp8"),
)

Sequence Length#

Override the model’s default maximum context length:

model, err := inference.LoadFile("model.gguf",
	inference.WithMaxSeqLen(4096),
)

TensorRT Backend#

For maximum throughput on NVIDIA GPUs, enable the TensorRT backend:

model, err := inference.LoadFile("model.gguf",
	inference.WithDevice("cuda"),
	inference.WithBackend("tensorrt"),
	inference.WithPrecision("fp16"),
)

Model Aliases#

Zerfoo maintains a table of short aliases for popular HuggingFace repositories. You can resolve an alias to its full repo ID or register your own:

// Resolves "gemma-3-1b-q4" -> "google/gemma-3-1b-it-qat-q4_0-gguf"
repoID := inference.ResolveAlias("gemma-3-1b-q4")

// Register a custom alias.
inference.RegisterAlias("my-model", "myorg/my-model-GGUF")

Understanding Quantization#

Quantization reduces model weights from 16- or 32-bit floats to lower-precision integers, trading a small amount of quality for significant memory savings and faster inference.

Common GGUF quantization types:

Type	Bits/Weight	Memory (7B model)	Quality	Use Case
F16	16	~14 GB	Baseline	Full quality, GPU with ample VRAM
Q8_0	8	~7 GB	Near-lossless	Best quality-to-size ratio
Q4_K_M	~4.5	~4 GB	Good	Recommended default for most users
Q4_0	4	~3.5 GB	Acceptable	Minimum viable quality

The quantization type is baked into the GGUF file at conversion time. Zerfoo reads the quantization metadata from each tensor and applies the correct dequantization during inference. You do not need to specify the quantization type at load time.

For a 1B parameter model like Gemma 3 1B with Q4_K_M quantization, expect roughly 800 MB of memory usage – small enough to run on a laptop CPU.

Inspecting Model Metadata#

After loading a model, you can access its metadata:

model, err := inference.LoadFile("model.gguf")
if err != nil {
	log.Fatal(err)
}
defer model.Close()

info := model.Info()
fmt.Printf("Architecture: %s\n", info.Architecture)
fmt.Printf("Parameters: %d\n", info.Parameters)

Next Steps#

Text Generation Deep Dive – sampling strategies, streaming, and performance tuning.
Running the OpenAI-Compatible API Server – serve models over HTTP.