Benchmarks#

Performance measurements, comparisons with other runtimes, and methodology for reproducing all benchmark results.

Hardware#

All official benchmarks run on a single machine:

ComponentSpec
SystemNVIDIA DGX Spark
SoCNVIDIA Grace Blackwell GB10 (sm_121)
Memory128 GB unified LPDDR5x (~200 GB/s bandwidth)
CUDA13.0
OSLinux (aarch64)
Go1.25.0

Model#

PropertyValue
ModelGemma 3 1B Instruct
FormatGGUF
QuantizationQ4_K_M
File size~0.8 GB
Sourcegoogle/gemma-3-1b-it-GGUF

Results#

Multi-Model Comparison: Zerfoo vs Ollama (2026-03-27)#

Head-to-head decode throughput on DGX Spark GB10. 128 tokens, 3 runs (median), greedy sampling (temp=0). All models produce coherent output after the GQA repeat fix (ztensor v0.6.3) and flash attention decode fix (zerfoo v1.25.5).

ModelSizeZerfoo (tok/s)Ollama (tok/s)RatioWinner
Gemma 3 1B Q4_K_M1B2411881.28xZerfoo
DeepSeek R1 1.5B Q4_K_M1.5B1861671.11xZerfoo
Llama 3.2 3B Q4_K_M3B92930.99x~Even
Mistral 7B Q5_K_M7B44441.00x~Even

Summary: 25% faster on small models, parity at 7B. All four models now produce coherent output with CUDA graph capture enabled.

Additional architectures (Qwen, Phi, Mixtral, Command-R, Falcon, Mamba, RWKV) will be added as GGUF files are acquired and parser compatibility is resolved.

Gemma 3 1B Baseline (2026-03-20)#

ModelFormatTok/sCUDA GraphTokensNotes
Gemma 3 1BQ4_K_M244.45Yes256Regression fixed (E103)
Gemma 3 1BQ4_K_M244.18Yes256Run 2
Gemma 3 1BQ4_K_M244.62Yes256Run 3

Roofline analysis: GB10 LPDDR5x ~200 GB/s, max ~257 tok/s for 778 MB model. Current 244 tok/s = 95% bandwidth utilization. 500 tok/s target requires hardware with higher memory bandwidth (A100/H100).

Previous Baseline (2026-03-17)#

ModelFormatTok/sCUDA GraphTokens
Gemma 3 1BQ4_K_M219.17Yes50
Gemma 3 1BQ4_K_M245.15Yes256
Gemma 3 1BQ4_K_M248.47Yes512
Gemma 3 1BQ4_K_M174.44No256
Ollama gemma3:1b-203.60-989

Note: dp4a INT8 GEMV and arena free-list reuse merged into ztensor. At batch=1 decode, throughput is memory-bandwidth-bound so dp4a shows parity as expected. dp4a benefits will appear at larger batch sizes where compute becomes the bottleneck.

Previous Baseline (2026-03-16)#

ModelFormatTok/sCUDA Graph %Output QualityTokens
Gemma 3 1BGGUF Q4_0103.22 (CUDA) / 9.20 (CPU)99.5%Coherent50
Gemma 3 1B FP16GGUF Q4_0 + FP169.20 (CPU)-Valid30
Gemma 3 1B FP8GGUF Q4_0 + FP8~0.5 (CPU)-Valid30
TinyLlama 1.1BGGUF Q4_K_M7.18 (CPU)-Low (small model)50
Qwen 2.5 0.5BGGUF Q4_K_M~13 (CPU)-Garbled (tokenizer bug)50
Mistral 7BGGUF Q4_K_M~0.55 (CPU)-Low (loads as llama)50
Phi-3.5 miniGGUF Q4_K_M--FAIL (merged QKV)-

Previous Baseline (2026-03-15)#

ModelFormatTok/sCUDA Graph %Output QualityTokens
Gemma 3 1BGGUF Q4_K24199.5%Baseline256
Llama 3 1BGGUF12.932.0%Semi-coherent20
Qwen 2.5 0.5BGGUF15.791.8%Working (rep. penalty helps)20
Mistral 7BGGUF3.941.2%Working (spaces fixed)20
Phi-3 miniGGUF4.140.5%Semi-coherent20

Comparison: Zerfoo vs Ollama vs llama.cpp#

Gemma 3 1B (Primary Benchmark)#

FrameworkVersionTokensTok/s (decode)CUDA GraphsNotes
Zerfoolatest128241YesMulti-model benchmark (2026-03-27)
Zerfoov0.x256244.45YesSingle-model baseline (2026-03-20)
Zerfoov0.x256174.44NoWithout CUDA graph capture
Ollama0.17.7128188N/AMulti-model benchmark (2026-03-27)
llama.cppb5220+256~210-230NoEstimated from community reports on GB10-class hardware

Summary:

  • Zerfoo with CUDA graphs: 241 tok/s (+28% vs Ollama)
  • Zerfoo without CUDA graphs: 174 tok/s (CUDA graph capture adds +35%)
  • Ollama: 188 tok/s (uses llama.cpp under the hood with its own overhead)

Note on llama.cpp numbers: Direct llama.cpp measurements on this exact DGX Spark unit are pending. The estimate above is based on published community benchmarks for GB10 / Blackwell-class hardware with Gemma 3 1B Q4_K_M. We will update this table when we complete our own llama.cpp runs.

Why Zerfoo Is Faster#

  1. CUDA graph capture (99.5% coverage): The entire decode step (26 transformer layers, attention, FFN, norms) is captured as a single CUDA graph. This eliminates per-kernel launch overhead (~5-10 us per launch x hundreds of kernels per token) and lets the GPU execute the full pipeline without returning control to the host.

  2. Fused kernels: Operations that are separate kernel launches in other frameworks are fused in Zerfoo:

    • FusedAddRMSNorm (residual addition + RMS normalization in one pass)
    • FusedQKNormRoPE (QK normalization + rotary position embeddings)
    • FusedSiluGate (SiLU activation + gating in the FFN)
    • Merged QKV and Gate+Up projections (single GEMV instead of 2-3 separate)
  3. Zero CGo overhead: GPU bindings use purego/dlopen instead of CGo. This avoids the ~200 ns per CGo call overhead that accumulates across thousands of CUDA API calls per token.

  4. Optimized Q4_0 GEMV: The quantized matrix-vector multiply kernel is hand-tuned for the decode path with coalesced memory access patterns and efficient warp-level reductions.

Expected Results by GPU Class#

GPUZerfoo (est.)Notes
DGX Spark GB10241 tok/sMeasured (Gemma 3 1B, 2026-03-27)
RTX 4090TBDCommunity contributions welcome
RTX 3090TBDCommunity contributions welcome
A100 80GBTBDCommunity contributions welcome
Apple M-series (CPU)~8-15 tok/sMetal backend not yet implemented

Vision Models#

Vision model benchmarks use synthetic weights with small dimensions for CI, and full GGUF models for hardware throughput validation.

ModelTestStatusEnv Var
LLaVABenchmarkLLaVA_ThroughputSynthetic (CI)-
LLaVATestLLaVA_VisionPipelineFull modelLLAVA_GGUF_PATH
Qwen-VLBenchmarkQwenVL_ThroughputSynthetic (CI)-
Qwen-VLTestQwenVL_VisionPipelineFull modelQWENVL_GGUF_PATH
# Run synthetic benchmarks
go test -bench BenchmarkLLaVA -count=1 ./tests/parity/
go test -bench BenchmarkQwenVL -count=1 ./tests/parity/

# Run full-model vision pipeline tests (requires GGUF files)
LLAVA_GGUF_PATH=/path/to/llava.gguf go test -run TestLLaVA_VisionPipeline -count=1 -v ./tests/parity/
QWENVL_GGUF_PATH=/path/to/qwenvl.gguf go test -run TestQwenVL_VisionPipeline -count=1 -v ./tests/parity/

Performance Milestones#

DateMilestoneTok/sNotes
2026-03-31Multi-model benchmark (3-run median)241+28% vs Ollama (188 tok/s)
2026-03-17dp4a + arena reuse245.15Parity at batch=1 (memory-bound); dp4a benefits at larger batches
2026-03-17Q4_0 re-quant restored244.99+32% vs regression
2026-03-14CUDA graph capture234.30+26% vs non-graph baseline
2026-03-13GPU-first pipeline6.84+33.6% from D2H elimination
2026-03-13Graph compilation6.86+5% from worker pool
2026-03-12NEON SIMD8.15+18.8% CPU acceleration
2026-03-12CPU baseline6.5parallelFor + xblas
2026-03-11Initial GPU5.1243% cgocall overhead
2026-03-10Initial CPU3.60Gemma 3 2B Q4

Methodology#

Every performance claim in this project is backed by a reproducible benchmark run. This section describes the measurement procedure so that anyone can independently verify the numbers.

Software Versions#

Always record exact versions alongside every benchmark result:

ComponentVersion
Go1.25.0
CUDA Toolkit13.0
Zerfoolatest main (record commit hash with each run)
ztensorv0.1.0 (see go.mod)
ztokenv0.1.0 (see go.mod)
float16v0.2.0
float8v0.2.0

Measurement Procedure#

  1. Warm-up phase – Run a short generation (16-32 tokens) to warm up the GPU, populate caches, and trigger JIT compilation / CUDA graph capture. Discard these results.

  2. Measurement window – Generate at least 256 tokens in decode mode. Measure wall-clock time from the first decode token to the last.

  3. Decode-only measurement – Report only decode throughput (tokens per second). Prefill / prompt-processing time is excluded from the tok/s number.

  4. CUDA graph coverage – Record the percentage of operations captured in CUDA graphs. The current baseline achieves 99.5% coverage.

  5. Repeat – Run the benchmark at least 3 times and report the median result.

Benchmark Commands#

Zerfoo:

go run ./cmd/bench_tps \
  -model /path/to/gemma-3-1b-q4_k_m.gguf \
  -tokens 256 \
  -prompt "The quick brown fox"
FlagDefaultDescription
-model(required)Path to GGUF model file
-tokens64Number of tokens to generate
-prompt""Input prompt text

For official benchmarks, always use -tokens 256 or higher.

Ollama:

ollama run gemma3:1b --verbose "The quick brown fox" 2>&1 | grep "eval rate"

llama.cpp:

./build/bin/llama-bench \
  -m /path/to/gemma-3-1b-it-Q4_K_M.gguf \
  -p 0 -n 256 -ngl 99

The -p 0 flag skips prompt processing to measure pure decode throughput. -ngl 99 offloads all layers to GPU.

How to Reproduce from Scratch#

# 1. Clone the repo
git clone https://github.com/zerfoo/zerfoo.git
cd zerfoo

# 2. Ensure Go 1.26+ is installed
go version

# 3. Download dependencies
go mod tidy

# 4. Download the model (Gemma 3 1B Q4_K_M GGUF from HuggingFace)
#    Place it at a known path, e.g. ~/models/gemma-3-1b-q4_k_m.gguf

# 5. Run the benchmark (3 times, take the median)
go run ./cmd/bench_tps -model ~/models/gemma-3-1b-q4_k_m.gguf -tokens 256
go run ./cmd/bench_tps -model ~/models/gemma-3-1b-q4_k_m.gguf -tokens 256
go run ./cmd/bench_tps -model ~/models/gemma-3-1b-q4_k_m.gguf -tokens 256

# 6. Record: median tok/s, CUDA graph %, commit hash
git rev-parse HEAD

Comparison Methodology#

When comparing against other runtimes (llama.cpp, Ollama, vLLM, etc.):

  1. Same hardware – Run both on the same machine.
  2. Same model – Use the identical GGUF file (or equivalent quantization).
  3. Same token count – Generate the same number of tokens (256+).
  4. Same prompt – Use the same input prompt.
  5. Decode-only – Compare decode tok/s, not end-to-end latency.
  6. Warm up both – Give the competing runtime the same warm-up opportunity.
  7. Report versions – Record exact version/commit of the competing runtime.

Running Your Own Benchmarks#

To get a fair comparison on your hardware:

  1. Use the same model file. All three frameworks read GGUF, so use the exact same .gguf file for each run.
  2. Match token counts. Set all frameworks to generate the same number of tokens (e.g., 256).
  3. Warm up. Run at least 3 warm-up iterations before measuring.
  4. Isolate the GPU. Close other GPU workloads. On Linux, check with nvidia-smi that no other processes are using the GPU.
  5. Report decode throughput. All numbers in this guide are decode throughput (tokens per second during autoregressive generation), not prompt processing (prefill) speed.
  6. Record your environment. Report: GPU model, CUDA version, driver version, CPU, RAM, OS, and framework version/commit hash.

Contributing Benchmarks#

We welcome benchmark contributions from the community. To submit results:

  1. Run all three frameworks on the same hardware using the methodology above.
  2. Open an issue or PR with your results, including full hardware and software version details.
  3. Include the raw JSON output from cmd/bench --output results.json for Zerfoo runs.