Model Loading and Architecture Support#

This tutorial covers the GGUF model format, the architectures Zerfoo supports, how to load models programmatically with various options, and what quantization levels mean for memory and quality.

The GGUF Format#

GGUF (GPT-Generated Unified Format) is a single-file model format designed for efficient inference. A GGUF file contains everything needed to run a model:

  • Metadata: architecture name, vocabulary size, hidden dimensions, RoPE parameters, chat template, and more.
  • Tokenizer: the full BPE vocabulary and merge rules embedded in the file’s metadata section.
  • Tensors: all model weights, stored in their quantized or full-precision representation with shape information.

Zerfoo uses GGUF as its sole model format. When you call inference.LoadFile, the framework parses the GGUF header, extracts the tokenizer, reads the architecture metadata, and builds a typed computation graph – all without any external config files.

model, err := inference.LoadFile("path/to/model.gguf")

GGUF files are memory-mapped by default. Zerfoo maps the file into virtual address space and lets the OS page tensor data from disk on demand — no weights are copied into heap memory at startup. This gives near-instant load times regardless of model size and allows loading models larger than physical RAM.

// mmap is the default — no options needed
model, err := inference.LoadFile("model.gguf")

// Opt out for heap loading (required for CUDA graph capture)
model, err := inference.LoadFile("model.gguf",
	inference.WithMmap(false),
)

Split GGUF files (the -NNNNN-of-NNNNN.gguf naming convention used for 70B+ models from HuggingFace) are detected and loaded automatically. Pass the path to the first shard — Zerfoo finds the rest.

// Load a 138 GB model (3 shards) on a 128 GB machine
model, err := inference.LoadFile("MiniMax-M2-Q4_K_M-00001-of-00003.gguf")

Supported Architectures#

Zerfoo includes architecture-specific graph builders for each model family. The architecture is detected automatically from GGUF metadata – you do not need to specify it.

ArchitectureKey FeaturesExample Models
Llama 3RoPE theta=500K, GQALlama 3.2 1B/3B, Llama 3.1 8B/70B
Llama 4Extended Llama architectureLlama 4 Scout
Gemma 3Tied embeddings, embedding scaling, QK norms, logit softcapGemma 3 1B/4B/12B/27B
Gemma 3nGemma 3 nano variantGemma 3n
MistralSliding window attentionMistral 7B v0.3
MixtralMixture of experts (MoE) with sliding windowMixtral 8x7B
Qwen 2Attention bias, RoPE theta=1MQwen 2.5 7B/14B/72B
Phi 3/4Partial rotary factorPhi-3 Mini, Phi-4
DeepSeek V3Multi-head Latent Attention (MLA), batched MoEDeepSeek V3
FalconMulti-query attentionFalcon 7B/40B
Command-RRetrieval-augmented generation architectureCommand-R
JambaHybrid Mamba-Transformer architectureJamba
Mamba/Mamba3State-space model (SSM), no attentionMamba
LLaVAVision-language multimodalLLaVA

Each architecture has a dedicated builder in the inference/ package (e.g., arch_llama.go, arch_gemma.go, arch_deepseek.go). The builder reads architecture-specific metadata fields and constructs the computation graph with the correct layer structure, attention mechanism, and normalization.

Loading Models Programmatically#

The inference.LoadFile function accepts functional options that control device placement, precision, and sequence length.

Device Selection#

// CPU inference (default).
model, err := inference.LoadFile("model.gguf")

// CUDA GPU inference.
model, err := inference.LoadFile("model.gguf",
	inference.WithDevice("cuda"),
)

Compute Precision#

// FP16 compute -- activations are converted F32->FP16 before GPU kernels.
model, err := inference.LoadFile("model.gguf",
	inference.WithDevice("cuda"),
	inference.WithDType("fp16"),
)

// FP8 quantization -- weights are quantized to FP8 E4M3 at load time.
model, err := inference.LoadFile("model.gguf",
	inference.WithDevice("cuda"),
	inference.WithDType("fp8"),
)

Sequence Length#

Override the model’s default maximum context length:

model, err := inference.LoadFile("model.gguf",
	inference.WithMaxSeqLen(4096),
)

TensorRT Backend#

For maximum throughput on NVIDIA GPUs, enable the TensorRT backend:

model, err := inference.LoadFile("model.gguf",
	inference.WithDevice("cuda"),
	inference.WithBackend("tensorrt"),
	inference.WithPrecision("fp16"),
)

Model Aliases#

Zerfoo maintains a table of short aliases for popular HuggingFace repositories. You can resolve an alias to its full repo ID or register your own:

// Resolves "gemma-3-1b-q4" -> "google/gemma-3-1b-it-qat-q4_0-gguf"
repoID := inference.ResolveAlias("gemma-3-1b-q4")

// Register a custom alias.
inference.RegisterAlias("my-model", "myorg/my-model-GGUF")

Understanding Quantization#

Quantization reduces model weights from 16- or 32-bit floats to lower-precision integers, trading a small amount of quality for significant memory savings and faster inference.

Common GGUF quantization types:

TypeBits/WeightMemory (7B model)QualityUse Case
F1616~14 GBBaselineFull quality, GPU with ample VRAM
Q8_08~7 GBNear-losslessBest quality-to-size ratio
Q4_K_M~4.5~4 GBGoodRecommended default for most users
Q4_04~3.5 GBAcceptableMinimum viable quality

The quantization type is baked into the GGUF file at conversion time. Zerfoo reads the quantization metadata from each tensor and applies the correct dequantization during inference. You do not need to specify the quantization type at load time.

For a 1B parameter model like Gemma 3 1B with Q4_K_M quantization, expect roughly 800 MB of memory usage – small enough to run on a laptop CPU.

Inspecting Model Metadata#

After loading a model, you can access its metadata:

model, err := inference.LoadFile("model.gguf")
if err != nil {
	log.Fatal(err)
}
defer model.Close()

info := model.Info()
fmt.Printf("Architecture: %s\n", info.Architecture)
fmt.Printf("Parameters: %d\n", info.Parameters)

Next Steps#