Zerfoo

Up and running in seconds

Load a model from HuggingFace, generate text, stream tokens — all with idiomatic Go.

chat.go

m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

response, _ := m.Chat("Explain Go interfaces in one sentence.")
fmt.Println(response)

stream.go

m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

ch, _ := m.ChatStream(ctx, "Tell me a joke.")
for tok := range ch {
    if !tok.Done {
        fmt.Print(tok.Text)
    }
}

embed.go

m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

vecs, _ := m.Embed([]string{
    "Go is statically typed.",
    "Rust has a borrow checker.",
})
score := vecs[0].CosineSimilarity(vecs[1])
fmt.Printf("similarity: %.4f\n", score)

structured.go

m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

schema := grammar.JSONSchema{
    Type: "object",
    Properties: map[string]*grammar.JSONSchema{
        "name": {Type: "string"},
        "age":  {Type: "number"},
    },
}
result, _ := m.Generate(ctx, "Generate a person named Alice.",
    zerfoo.WithSchema(schema))
// {"name": "Alice", "age": 30}

Built for production Go

Everything you need to build, train, and deploy ML models in your Go services.

⚡

Zero CGo

go build ./... compiles everywhere. GPU acceleration is loaded dynamically at runtime via dlopen. No C compiler, no build tags, no CUDA toolkit at build time.

🚀

GPU Accelerated

25+ custom CUDA kernels, CUDA graph capture covering 99.5% of decode instructions, and fused operations (RoPE, SwiGLU, AddRMSNorm, QKNormRoPE). ROCm and OpenCL backends available.

📦

Embeddable Library

Import github.com/zerfoo/zerfoo and call model.Chat(). No sidecar process, no HTTP boundary, no Python runtime. ML inference is a function call.

🎯

OpenAI-Compatible API

Built-in HTTP server with /v1/chat/completions, SSE streaming, embeddings, Prometheus metrics, and health checks. Drop-in replacement for OpenAI clients.

🔎

Structured Output & Tools

Grammar-guided decoding constrains model output to valid JSON matching your schema. Tool calling with OpenAI-compatible function detection built in.

🧮

Type-Safe Generics

Go 1.26 generics throughout — tensor.Numeric constraint for compile-time type safety across float32, float16, bfloat16, float8, and quantized types.

📊

HuggingFace Integration

zerfoo.Load("google/gemma-3-4b") downloads and caches models automatically. Specify quantization variants with a /Q8_0 suffix.

🔐

Production Ready

TLS/mTLS, graceful shutdown, structured logging, Prometheus metrics, health checks, and distributed training via gRPC/NCCL. Built for real workloads.

⚙️

ARM NEON SIMD

Hand-written ARM NEON and x86 AVX2 assembly for CPU-bound operations — GEMM, RMSNorm, RoPE, SiLU, softmax. Competitive CPU performance without a GPU.

⚡

EAGLE Speculative Decoding

Built-in EAGLE draft head with integrated training. Speculative decoding accelerates generation by drafting multiple tokens per step and verifying in parallel.

🚀

Q4_K Fused GEMV

Fused dequantize-and-multiply kernel for Q4_K quantized weights — 14x faster than the unfused path. Keeps decode throughput high on quantized models.

🔎

Advanced Serving

Multi-LoRA per-request serving, quantized KV cache (Q4/Q3), and hybrid CPU/GPU MoE routing. Production-grade features for multi-tenant deployments.

💾

Models Larger Than RAM

GGUF files are memory-mapped by default. Tensor data is paged from NVMe on demand — the OS handles it. A 229B MiniMax-M2 (138 GB, 3 shards) runs on 128 GB with no flags and no configuration.

Faster than Ollama

Benchmarked on NVIDIA DGX Spark (GB10), CUDA 13.0, Go 1.26. Gemma 3 1B Q4_K_M, 256 tokens.

Runtime	Throughput	Notes
Zerfoo	241 tok/s	Pure Go, zero CGo, CUDA graph capture, fused kernels
Ollama	188 tok/s	llama.cpp C++ backend, same model, same hardware

Performance journey

Date	Milestone	Tok/s	Improvement
Mar 27	Multi-model benchmark (3-run median)	241	+28% vs Ollama
Mar 17	Q4_0 re-quant restored	245	+32% vs regression
Mar 14	CUDA graph capture	234	+26% vs non-graph
Mar 13	GPU-first pipeline	103	D2H elimination
Mar 12	ARM NEON SIMD	8.15	+18.8% CPU accel
Mar 10	Initial CPU baseline	3.60	Starting point

Models larger than RAM

GGUF files are memory-mapped by default. Tensor data pages from NVMe on demand — no flags needed. Split GGUF shards are detected and mapped automatically.

Model	Params	File size	RAM available	Status
MiniMax-M2 Q4_K_M	229B MoE	138 GB (3 shards)	128 GB	Loads and generates text

Hardware: NVIDIA DGX Spark GB10, sm_121, 128GB LPDDR5x. Methodology: 3 runs, 32-token warmup, median reported. Full details in benchmarks.

Supported models

41 architectures across 25 model families. Load any GGUF model from HuggingFace.

Gemma 3/3n

Transformer

Llama 3/4

Transformer

Qwen 2.5

Transformer

Mistral/Mixtral

Transformer + MoE

Phi 3/4

Transformer

DeepSeek V3

MLA + MoE

GPT-2

Transformer

Nemotron-H

Hybrid Mamba + MoE

MiniMax M2

Sigmoid MoE

Command R

Transformer

Falcon

Transformer

RWKV

Linear attention

Mamba/Mamba 3

State space model

Jamba

Hybrid SSM + Transformer

Whisper

Audio transcription

Voxtral

Speech-to-text

LLaVA/Qwen-VL

Vision-language

BERT

Encoder

Granite TS

Time series

GLM-4/ChatGLM

Transformer + MoE

Kimi K2

Linear attention MoE

LFM2

Hybrid MoE

OLMo 2

Transformer

EXAONE

Transformer

StarCoder 2

Code generation

InternLM 2

Transformer

DBRX

Fine-grained MoE

Uses GGUF as the sole model format. Compatible with llama.cpp, Ollama, LM Studio, and GPT4All model files.

CLI included

Pull models, run inference, and serve an OpenAI-compatible API from the command line.

terminal

# Install
$ go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest

# Pull a model from HuggingFace
$ zerfoo pull gemma-3-1b-q4

# Interactive chat
$ zerfoo run gemma-3-1b-q4
Model loaded. Type your message (Ctrl-D to quit).
> What is the capital of France?
The capital of France is Paris.

# OpenAI-compatible API server
$ zerfoo serve gemma-3-1b-q4 --port 8080

# QuaRot weight fusion for uniform 4-bit quantization
$ zerfoo run --quarot model.gguf

# Train an EAGLE speculative decoding head
$ zerfoo eagle-train --model model.gguf --corpus data.txt --output eagle.gguf

# Convert MHA model to Multi-head Latent Attention
$ zerfoo transmla --input model.gguf --output model-mla.gguf

# Multi-LoRA serving (per-request adapter selection)
$ curl http://localhost:8080/v1/chat/completions \
    -d '{"model":"gemma3-1b:my-lora","messages":[{"role":"user","content":"Hello!"}]}'

Machine learning for Go.
Pure Go. Zero CGo.

Up and running in seconds

Built for production Go

Zero CGo

GPU Accelerated

Embeddable Library

OpenAI-Compatible API

Structured Output & Tools

Type-Safe Generics

HuggingFace Integration

Production Ready

ARM NEON SIMD

EAGLE Speculative Decoding

Q4_K Fused GEMV

Advanced Serving

Models Larger Than RAM

Faster than Ollama

Performance journey

Models larger than RAM

Supported models

CLI included

The Zerfoo ecosystem

ztensor

ztoken

float16

float8

zonnx

From the blog

How We Beat Ollama: CUDA Graph Capture in Pure Go

Zero CGo: Why We Chose Pure Go for ML Inference

GGUF: Why We Standardized on the Industry Format

Add ML Inference to Your Go Service in 10 Lines

Ready to add ML to your Go app?

Machine learning for Go.Pure Go. Zero CGo.

Up and running in seconds

Built for production Go

Zero CGo

GPU Accelerated

Embeddable Library

OpenAI-Compatible API

Structured Output & Tools

Type-Safe Generics

HuggingFace Integration

Production Ready

ARM NEON SIMD

EAGLE Speculative Decoding

Q4_K Fused GEMV

Advanced Serving

Models Larger Than RAM

Faster than Ollama

Performance journey

Models larger than RAM

Supported models

CLI included

The Zerfoo ecosystem

zerfoo

ztensor

ztoken

float16

float8

zonnx

From the blog

How We Beat Ollama: CUDA Graph Capture in Pure Go

Zero CGo: Why We Chose Pure Go for ML Inference

GGUF: Why We Standardized on the Industry Format

Add ML Inference to Your Go Service in 10 Lines

Ready to add ML to your Go app?

Machine learning for Go.
Pure Go. Zero CGo.