ONNX to GGUF Conversion#
This guide walks through converting an ONNX model to GGUF format using zonnx. The resulting GGUF file can be loaded by zerfoo or llama.cpp.
Prerequisites#
- zonnx installed (
go install github.com/zerfoo/zonnx/cmd/zonnx@latest) - An ONNX model file, either local or on HuggingFace
Step 1: Download a Model from HuggingFace#
Use the download command to fetch an ONNX model and its tokenizer files:
zonnx download --model google/gemma-2-2b-it --output ./modelsFor gated models that require authentication:
# Via flag
zonnx download --model meta-llama/Llama-3-8B --output ./models --api-key YOUR_HF_TOKEN
# Via environment variable
export HF_API_KEY=YOUR_HF_TOKEN
zonnx download --model meta-llama/Llama-3-8B --output ./modelsThe --api-key flag takes precedence over the HF_API_KEY environment variable.
After downloading, you should have at minimum:
models/
model.onnx
config.json # optional but recommended for metadata
tokenizer.json # downloaded automatically if availableStep 2: Convert to GGUF#
Run the convert command with the appropriate --arch flag:
zonnx convert --arch gemma --output ./models/gemma-2b.gguf ./models/model.onnxThe --arch Flag#
The --arch flag selects the tensor name mapping and metadata mapping for the target architecture. If a config.json file exists alongside the ONNX file, zonnx reads it automatically and maps HuggingFace config fields to GGUF metadata keys.
If --arch is omitted, it defaults to llama.
Convert Command Flags#
| Flag | Default | Description |
|---|---|---|
--output | <input-dir>/<input-base>.gguf | Output GGUF file path |
--arch | llama | Model architecture for metadata and tensor mapping |
--format | onnx | Input format: onnx or safetensors |
--quantize | (none) | Quantize weights: q4_0 or q8_0 |
Step 3: Quantize During Conversion (Optional)#
To reduce model size, quantize weights during conversion:
# 4-bit quantization (smallest, some quality loss)
zonnx convert --arch gemma --quantize q4_0 --output ./models/gemma-2b-q4.gguf ./models/model.onnx
# 8-bit quantization (good balance of size and quality)
zonnx convert --arch gemma --quantize q8_0 --output ./models/gemma-2b-q8.gguf ./models/model.onnx| Quantization | Bits per Weight | Use Case |
|---|---|---|
| (none) | 32 | Full precision, largest file |
q8_0 | 8 | Good quality, ~4x smaller than F32 |
q4_0 | 4 | Smallest, ~8x smaller than F32 |
Step 4: Verify the Output#
Inspect the generated GGUF file to confirm metadata and tensors:
zonnx inspect --pretty ./models/gemma-2b.ggufSupported Architectures#
| Architecture | --arch value | Tensor Mapping | Notes |
|---|---|---|---|
| Llama | llama (default) | Decoder layers (model.layers.N.*) | Llama 3, Code Llama |
| Gemma | gemma | Decoder layers (model.layers.N.*) | Gemma, Gemma 2, Gemma 3 |
| BERT | bert | Encoder layers (bert.encoder.layer.N.*) | Classification, embeddings |
| RoBERTa | roberta | Encoder layers (roberta.encoder.layer.N.*) | Same structure as BERT |
Metadata Mapping#
When a config.json file is present alongside the ONNX model, zonnx maps these HuggingFace fields to GGUF metadata:
| config.json field | GGUF key |
|---|---|
hidden_size | {arch}.embedding_length |
num_hidden_layers | {arch}.block_count |
num_attention_heads | {arch}.attention.head_count |
num_key_value_heads | {arch}.attention.head_count_kv |
intermediate_size | {arch}.feed_forward_length |
vocab_size | {arch}.vocab_size |
max_position_embeddings | {arch}.context_length |
rms_norm_eps | {arch}.attention.layer_norm_rms_epsilon |
rope_theta | {arch}.rope.freq_base |
Using the GGUF File with Zerfoo#
Once converted, load the model with zerfoo:
zerfoo run ./models/gemma-2b.gguf --prompt "Hello, world!"Or serve it as an OpenAI-compatible API:
zerfoo serve ./models/gemma-2b.gguf