Benchmarks#
Performance measurements, comparisons with other runtimes, and methodology for reproducing all benchmark results.
Hardware#
All official benchmarks run on a single machine:
| Component | Spec |
|---|---|
| System | NVIDIA DGX Spark |
| SoC | NVIDIA Grace Blackwell GB10 (sm_121) |
| Memory | 128 GB unified LPDDR5x (~200 GB/s bandwidth) |
| CUDA | 13.0 |
| OS | Linux (aarch64) |
| Go | 1.25.0 |
Model#
| Property | Value |
|---|---|
| Model | Gemma 3 1B Instruct |
| Format | GGUF |
| Quantization | Q4_K_M |
| File size | ~0.8 GB |
| Source | google/gemma-3-1b-it-GGUF |
Results#
Multi-Model Comparison: Zerfoo vs Ollama (2026-03-27)#
Head-to-head decode throughput on DGX Spark GB10. 128 tokens, 3 runs (median), greedy sampling (temp=0). All models produce coherent output after the GQA repeat fix (ztensor v0.6.3) and flash attention decode fix (zerfoo v1.25.5).
| Model | Size | Zerfoo (tok/s) | Ollama (tok/s) | Ratio | Winner |
|---|---|---|---|---|---|
| Gemma 3 1B Q4_K_M | 1B | 241 | 188 | 1.28x | Zerfoo |
| DeepSeek R1 1.5B Q4_K_M | 1.5B | 186 | 167 | 1.11x | Zerfoo |
| Llama 3.2 3B Q4_K_M | 3B | 92 | 93 | 0.99x | ~Even |
| Mistral 7B Q5_K_M | 7B | 44 | 44 | 1.00x | ~Even |
Summary: 25% faster on small models, parity at 7B. All four models now produce coherent output with CUDA graph capture enabled.
Additional architectures (Qwen, Phi, Mixtral, Command-R, Falcon, Mamba, RWKV) will be added as GGUF files are acquired and parser compatibility is resolved.
Gemma 3 1B Baseline (2026-03-20)#
| Model | Format | Tok/s | CUDA Graph | Tokens | Notes |
|---|---|---|---|---|---|
| Gemma 3 1B | Q4_K_M | 244.45 | Yes | 256 | Regression fixed (E103) |
| Gemma 3 1B | Q4_K_M | 244.18 | Yes | 256 | Run 2 |
| Gemma 3 1B | Q4_K_M | 244.62 | Yes | 256 | Run 3 |
Roofline analysis: GB10 LPDDR5x ~200 GB/s, max ~257 tok/s for 778 MB model. Current 244 tok/s = 95% bandwidth utilization. 500 tok/s target requires hardware with higher memory bandwidth (A100/H100).
Previous Baseline (2026-03-17)#
| Model | Format | Tok/s | CUDA Graph | Tokens |
|---|---|---|---|---|
| Gemma 3 1B | Q4_K_M | 219.17 | Yes | 50 |
| Gemma 3 1B | Q4_K_M | 245.15 | Yes | 256 |
| Gemma 3 1B | Q4_K_M | 248.47 | Yes | 512 |
| Gemma 3 1B | Q4_K_M | 174.44 | No | 256 |
| Ollama gemma3:1b | - | 203.60 | - | 989 |
Note: dp4a INT8 GEMV and arena free-list reuse merged into ztensor. At batch=1 decode, throughput is memory-bandwidth-bound so dp4a shows parity as expected. dp4a benefits will appear at larger batch sizes where compute becomes the bottleneck.
Previous Baseline (2026-03-16)#
| Model | Format | Tok/s | CUDA Graph % | Output Quality | Tokens |
|---|---|---|---|---|---|
| Gemma 3 1B | GGUF Q4_0 | 103.22 (CUDA) / 9.20 (CPU) | 99.5% | Coherent | 50 |
| Gemma 3 1B FP16 | GGUF Q4_0 + FP16 | 9.20 (CPU) | - | Valid | 30 |
| Gemma 3 1B FP8 | GGUF Q4_0 + FP8 | ~0.5 (CPU) | - | Valid | 30 |
| TinyLlama 1.1B | GGUF Q4_K_M | 7.18 (CPU) | - | Low (small model) | 50 |
| Qwen 2.5 0.5B | GGUF Q4_K_M | ~13 (CPU) | - | Garbled (tokenizer bug) | 50 |
| Mistral 7B | GGUF Q4_K_M | ~0.55 (CPU) | - | Low (loads as llama) | 50 |
| Phi-3.5 mini | GGUF Q4_K_M | - | - | FAIL (merged QKV) | - |
Previous Baseline (2026-03-15)#
| Model | Format | Tok/s | CUDA Graph % | Output Quality | Tokens |
|---|---|---|---|---|---|
| Gemma 3 1B | GGUF Q4_K | 241 | 99.5% | Baseline | 256 |
| Llama 3 1B | GGUF | 12.93 | 2.0% | Semi-coherent | 20 |
| Qwen 2.5 0.5B | GGUF | 15.79 | 1.8% | Working (rep. penalty helps) | 20 |
| Mistral 7B | GGUF | 3.94 | 1.2% | Working (spaces fixed) | 20 |
| Phi-3 mini | GGUF | 4.14 | 0.5% | Semi-coherent | 20 |
Comparison: Zerfoo vs Ollama vs llama.cpp#
Gemma 3 1B (Primary Benchmark)#
| Framework | Version | Tokens | Tok/s (decode) | CUDA Graphs | Notes |
|---|---|---|---|---|---|
| Zerfoo | latest | 128 | 241 | Yes | Multi-model benchmark (2026-03-27) |
| Zerfoo | v0.x | 256 | 244.45 | Yes | Single-model baseline (2026-03-20) |
| Zerfoo | v0.x | 256 | 174.44 | No | Without CUDA graph capture |
| Ollama | 0.17.7 | 128 | 188 | N/A | Multi-model benchmark (2026-03-27) |
| llama.cpp | b5220+ | 256 | ~210-230 | No | Estimated from community reports on GB10-class hardware |
Summary:
- Zerfoo with CUDA graphs: 241 tok/s (+28% vs Ollama)
- Zerfoo without CUDA graphs: 174 tok/s (CUDA graph capture adds +35%)
- Ollama: 188 tok/s (uses llama.cpp under the hood with its own overhead)
Note on llama.cpp numbers: Direct llama.cpp measurements on this exact DGX Spark unit are pending. The estimate above is based on published community benchmarks for GB10 / Blackwell-class hardware with Gemma 3 1B Q4_K_M. We will update this table when we complete our own llama.cpp runs.
Why Zerfoo Is Faster#
CUDA graph capture (99.5% coverage): The entire decode step (26 transformer layers, attention, FFN, norms) is captured as a single CUDA graph. This eliminates per-kernel launch overhead (~5-10 us per launch x hundreds of kernels per token) and lets the GPU execute the full pipeline without returning control to the host.
Fused kernels: Operations that are separate kernel launches in other frameworks are fused in Zerfoo:
FusedAddRMSNorm(residual addition + RMS normalization in one pass)FusedQKNormRoPE(QK normalization + rotary position embeddings)FusedSiluGate(SiLU activation + gating in the FFN)- Merged QKV and Gate+Up projections (single GEMV instead of 2-3 separate)
Zero CGo overhead: GPU bindings use purego/dlopen instead of CGo. This avoids the ~200 ns per CGo call overhead that accumulates across thousands of CUDA API calls per token.
Optimized Q4_0 GEMV: The quantized matrix-vector multiply kernel is hand-tuned for the decode path with coalesced memory access patterns and efficient warp-level reductions.
Expected Results by GPU Class#
| GPU | Zerfoo (est.) | Notes |
|---|---|---|
| DGX Spark GB10 | 241 tok/s | Measured (Gemma 3 1B, 2026-03-27) |
| RTX 4090 | TBD | Community contributions welcome |
| RTX 3090 | TBD | Community contributions welcome |
| A100 80GB | TBD | Community contributions welcome |
| Apple M-series (CPU) | ~8-15 tok/s | Metal backend not yet implemented |
Vision Models#
Vision model benchmarks use synthetic weights with small dimensions for CI, and full GGUF models for hardware throughput validation.
| Model | Test | Status | Env Var |
|---|---|---|---|
| LLaVA | BenchmarkLLaVA_Throughput | Synthetic (CI) | - |
| LLaVA | TestLLaVA_VisionPipeline | Full model | LLAVA_GGUF_PATH |
| Qwen-VL | BenchmarkQwenVL_Throughput | Synthetic (CI) | - |
| Qwen-VL | TestQwenVL_VisionPipeline | Full model | QWENVL_GGUF_PATH |
# Run synthetic benchmarks
go test -bench BenchmarkLLaVA -count=1 ./tests/parity/
go test -bench BenchmarkQwenVL -count=1 ./tests/parity/
# Run full-model vision pipeline tests (requires GGUF files)
LLAVA_GGUF_PATH=/path/to/llava.gguf go test -run TestLLaVA_VisionPipeline -count=1 -v ./tests/parity/
QWENVL_GGUF_PATH=/path/to/qwenvl.gguf go test -run TestQwenVL_VisionPipeline -count=1 -v ./tests/parity/Performance Milestones#
| Date | Milestone | Tok/s | Notes |
|---|---|---|---|
| 2026-03-31 | Multi-model benchmark (3-run median) | 241 | +28% vs Ollama (188 tok/s) |
| 2026-03-17 | dp4a + arena reuse | 245.15 | Parity at batch=1 (memory-bound); dp4a benefits at larger batches |
| 2026-03-17 | Q4_0 re-quant restored | 244.99 | +32% vs regression |
| 2026-03-14 | CUDA graph capture | 234.30 | +26% vs non-graph baseline |
| 2026-03-13 | GPU-first pipeline | 6.84 | +33.6% from D2H elimination |
| 2026-03-13 | Graph compilation | 6.86 | +5% from worker pool |
| 2026-03-12 | NEON SIMD | 8.15 | +18.8% CPU acceleration |
| 2026-03-12 | CPU baseline | 6.5 | parallelFor + xblas |
| 2026-03-11 | Initial GPU | 5.12 | 43% cgocall overhead |
| 2026-03-10 | Initial CPU | 3.60 | Gemma 3 2B Q4 |
Methodology#
Every performance claim in this project is backed by a reproducible benchmark run. This section describes the measurement procedure so that anyone can independently verify the numbers.
Software Versions#
Always record exact versions alongside every benchmark result:
| Component | Version |
|---|---|
| Go | 1.25.0 |
| CUDA Toolkit | 13.0 |
| Zerfoo | latest main (record commit hash with each run) |
| ztensor | v0.1.0 (see go.mod) |
| ztoken | v0.1.0 (see go.mod) |
| float16 | v0.2.0 |
| float8 | v0.2.0 |
Measurement Procedure#
Warm-up phase – Run a short generation (16-32 tokens) to warm up the GPU, populate caches, and trigger JIT compilation / CUDA graph capture. Discard these results.
Measurement window – Generate at least 256 tokens in decode mode. Measure wall-clock time from the first decode token to the last.
Decode-only measurement – Report only decode throughput (tokens per second). Prefill / prompt-processing time is excluded from the tok/s number.
CUDA graph coverage – Record the percentage of operations captured in CUDA graphs. The current baseline achieves 99.5% coverage.
Repeat – Run the benchmark at least 3 times and report the median result.
Benchmark Commands#
Zerfoo:
go run ./cmd/bench_tps \
-model /path/to/gemma-3-1b-q4_k_m.gguf \
-tokens 256 \
-prompt "The quick brown fox"| Flag | Default | Description |
|---|---|---|
-model | (required) | Path to GGUF model file |
-tokens | 64 | Number of tokens to generate |
-prompt | "" | Input prompt text |
For official benchmarks, always use -tokens 256 or higher.
Ollama:
ollama run gemma3:1b --verbose "The quick brown fox" 2>&1 | grep "eval rate"llama.cpp:
./build/bin/llama-bench \
-m /path/to/gemma-3-1b-it-Q4_K_M.gguf \
-p 0 -n 256 -ngl 99The -p 0 flag skips prompt processing to measure pure decode throughput.
-ngl 99 offloads all layers to GPU.
How to Reproduce from Scratch#
# 1. Clone the repo
git clone https://github.com/zerfoo/zerfoo.git
cd zerfoo
# 2. Ensure Go 1.26+ is installed
go version
# 3. Download dependencies
go mod tidy
# 4. Download the model (Gemma 3 1B Q4_K_M GGUF from HuggingFace)
# Place it at a known path, e.g. ~/models/gemma-3-1b-q4_k_m.gguf
# 5. Run the benchmark (3 times, take the median)
go run ./cmd/bench_tps -model ~/models/gemma-3-1b-q4_k_m.gguf -tokens 256
go run ./cmd/bench_tps -model ~/models/gemma-3-1b-q4_k_m.gguf -tokens 256
go run ./cmd/bench_tps -model ~/models/gemma-3-1b-q4_k_m.gguf -tokens 256
# 6. Record: median tok/s, CUDA graph %, commit hash
git rev-parse HEADComparison Methodology#
When comparing against other runtimes (llama.cpp, Ollama, vLLM, etc.):
- Same hardware – Run both on the same machine.
- Same model – Use the identical GGUF file (or equivalent quantization).
- Same token count – Generate the same number of tokens (256+).
- Same prompt – Use the same input prompt.
- Decode-only – Compare decode tok/s, not end-to-end latency.
- Warm up both – Give the competing runtime the same warm-up opportunity.
- Report versions – Record exact version/commit of the competing runtime.
Running Your Own Benchmarks#
To get a fair comparison on your hardware:
- Use the same model file. All three frameworks read GGUF, so use the
exact same
.gguffile for each run. - Match token counts. Set all frameworks to generate the same number of tokens (e.g., 256).
- Warm up. Run at least 3 warm-up iterations before measuring.
- Isolate the GPU. Close other GPU workloads. On Linux, check with
nvidia-smithat no other processes are using the GPU. - Report decode throughput. All numbers in this guide are decode throughput (tokens per second during autoregressive generation), not prompt processing (prefill) speed.
- Record your environment. Report: GPU model, CUDA version, driver version, CPU, RAM, OS, and framework version/commit hash.
Contributing Benchmarks#
We welcome benchmark contributions from the community. To submit results:
- Run all three frameworks on the same hardware using the methodology above.
- Open an issue or PR with your results, including full hardware and software version details.
- Include the raw JSON output from
cmd/bench --output results.jsonfor Zerfoo runs.