Machine learning for Go.
Pure Go. Zero CGo.

Train, run, and serve ML models in your Go application. One import, GPU-accelerated at runtime, no C compiler needed.

245 tok/s
Gemma 3 1B Q4_K_M
+20%
faster than Ollama
99.5%
CUDA graph coverage
0
CGo calls
$ go get github.com/zerfoo/zerfoo

Up and running in seconds

Load a model from HuggingFace, generate text, stream tokens — all with idiomatic Go.

chat.go
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

response, _ := m.Chat("Explain Go interfaces in one sentence.")
fmt.Println(response)
stream.go
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

ch, _ := m.ChatStream(ctx, "Tell me a joke.")
for tok := range ch {
    if !tok.Done {
        fmt.Print(tok.Text)
    }
}
embed.go
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

vecs, _ := m.Embed([]string{
    "Go is statically typed.",
    "Rust has a borrow checker.",
})
score := vecs[0].CosineSimilarity(vecs[1])
fmt.Printf("similarity: %.4f\n", score)
structured.go
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

schema := grammar.JSONSchema{
    Type: "object",
    Properties: map[string]*grammar.JSONSchema{
        "name": {Type: "string"},
        "age":  {Type: "number"},
    },
}
result, _ := m.Generate(ctx, "Generate a person named Alice.",
    zerfoo.WithSchema(schema))
// {"name": "Alice", "age": 30}

Built for production Go

Everything you need to build, train, and deploy ML models in your Go services.

Zero CGo

go build ./... compiles everywhere. GPU acceleration is loaded dynamically at runtime via dlopen. No C compiler, no build tags, no CUDA toolkit at build time.

🚀

GPU Accelerated

25+ custom CUDA kernels, CUDA graph capture covering 99.5% of decode instructions, and fused operations (RoPE, SwiGLU, AddRMSNorm, QKNormRoPE). ROCm and OpenCL backends available.

📦

Embeddable Library

Import github.com/zerfoo/zerfoo and call model.Chat(). No sidecar process, no HTTP boundary, no Python runtime. ML inference is a function call.

🎯

OpenAI-Compatible API

Built-in HTTP server with /v1/chat/completions, SSE streaming, embeddings, Prometheus metrics, and health checks. Drop-in replacement for OpenAI clients.

🔎

Structured Output & Tools

Grammar-guided decoding constrains model output to valid JSON matching your schema. Tool calling with OpenAI-compatible function detection built in.

🧮

Type-Safe Generics

Go 1.25 generics throughout — tensor.Numeric constraint for compile-time type safety across float32, float16, bfloat16, float8, and quantized types.

📊

HuggingFace Integration

zerfoo.Load("google/gemma-3-4b") downloads and caches models automatically. Specify quantization variants with a /Q8_0 suffix.

🔐

Production Ready

TLS/mTLS, graceful shutdown, structured logging, Prometheus metrics, health checks, and distributed training via gRPC/NCCL. Built for real workloads.

⚙️

ARM NEON SIMD

Hand-written ARM NEON and x86 AVX2 assembly for CPU-bound operations — GEMM, RMSNorm, RoPE, SiLU, softmax. Competitive CPU performance without a GPU.

Faster than Ollama

Benchmarked on NVIDIA DGX Spark (GB10), CUDA 13.0, Go 1.25. Gemma 3 1B Q4_K_M, 256 tokens.

RuntimeThroughputNotes
Zerfoo
245 tok/s
Pure Go, zero CGo, CUDA graph capture, fused kernels
Ollama
204 tok/s
llama.cpp C++ backend, same model, same hardware

Performance journey

DateMilestoneTok/sImprovement
Mar 17dp4a INT8 + arena reuse245+20% vs Ollama
Mar 17Q4_0 re-quant restored245+32% vs regression
Mar 14CUDA graph capture234+26% vs non-graph
Mar 13GPU-first pipeline103D2H elimination
Mar 12ARM NEON SIMD8.15+18.8% CPU accel
Mar 10Initial CPU baseline3.60Starting point
Hardware: NVIDIA DGX Spark GB10, sm_121, 128GB LPDDR5x. Methodology: 3 runs, 32-token warmup, median reported. Full details in benchmarking-methodology.md.

Supported models

Production-ready transformer architectures. Load any GGUF model from HuggingFace.

Gemma 3
Production
Llama 3
Production
Qwen 2.5
Production
Mistral
Production
Phi 3/4
Production
DeepSeek V3
Production
SigLIP
Vision encoder
Kimi-VL
Vision-language

Uses GGUF as the sole model format. Compatible with llama.cpp, Ollama, LM Studio, and GPT4All model files.

CLI included

Pull models, run inference, and serve an OpenAI-compatible API from the command line.

terminal
# Install
$ go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest

# Pull a model from HuggingFace
$ zerfoo pull gemma-3-1b-q4

# Interactive chat
$ zerfoo run gemma-3-1b-q4
Model loaded. Type your message (Ctrl-D to quit).
> What is the capital of France?
The capital of France is Paris.

# OpenAI-compatible API server
$ zerfoo serve gemma-3-1b-q4 --port 8080

# Query with any OpenAI client
$ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"gemma-3-1b-q4","messages":[{"role":"user","content":"Hello!"}]}'

The Zerfoo ecosystem

Six focused Go modules that compose together.

From the blog

Deep dives into the engineering behind Zerfoo.

Ready to add ML to your Go app?

One import. One function call. No Python, no CGo, no sidecar.

$ go get github.com/zerfoo/zerfoo
Getting Started Guide View Source