Train, run, and serve ML models in your Go application. One import, GPU-accelerated at runtime, no C compiler needed.
Load a model from HuggingFace, generate text, stream tokens — all with idiomatic Go.
m, _ := zerfoo.Load("google/gemma-3-4b") defer m.Close() response, _ := m.Chat("Explain Go interfaces in one sentence.") fmt.Println(response)
m, _ := zerfoo.Load("google/gemma-3-4b") defer m.Close() ch, _ := m.ChatStream(ctx, "Tell me a joke.") for tok := range ch { if !tok.Done { fmt.Print(tok.Text) } }
m, _ := zerfoo.Load("google/gemma-3-4b") defer m.Close() vecs, _ := m.Embed([]string{ "Go is statically typed.", "Rust has a borrow checker.", }) score := vecs[0].CosineSimilarity(vecs[1]) fmt.Printf("similarity: %.4f\n", score)
m, _ := zerfoo.Load("google/gemma-3-4b") defer m.Close() schema := grammar.JSONSchema{ Type: "object", Properties: map[string]*grammar.JSONSchema{ "name": {Type: "string"}, "age": {Type: "number"}, }, } result, _ := m.Generate(ctx, "Generate a person named Alice.", zerfoo.WithSchema(schema)) // {"name": "Alice", "age": 30}
Everything you need to build, train, and deploy ML models in your Go services.
go build ./... compiles everywhere. GPU acceleration is loaded dynamically at runtime via dlopen. No C compiler, no build tags, no CUDA toolkit at build time.
25+ custom CUDA kernels, CUDA graph capture covering 99.5% of decode instructions, and fused operations (RoPE, SwiGLU, AddRMSNorm, QKNormRoPE). ROCm and OpenCL backends available.
Import github.com/zerfoo/zerfoo and call model.Chat(). No sidecar process, no HTTP boundary, no Python runtime. ML inference is a function call.
Built-in HTTP server with /v1/chat/completions, SSE streaming, embeddings, Prometheus metrics, and health checks. Drop-in replacement for OpenAI clients.
Grammar-guided decoding constrains model output to valid JSON matching your schema. Tool calling with OpenAI-compatible function detection built in.
Go 1.26 generics throughout — tensor.Numeric constraint for compile-time type safety across float32, float16, bfloat16, float8, and quantized types.
zerfoo.Load("google/gemma-3-4b") downloads and caches models automatically. Specify quantization variants with a /Q8_0 suffix.
TLS/mTLS, graceful shutdown, structured logging, Prometheus metrics, health checks, and distributed training via gRPC/NCCL. Built for real workloads.
Hand-written ARM NEON and x86 AVX2 assembly for CPU-bound operations — GEMM, RMSNorm, RoPE, SiLU, softmax. Competitive CPU performance without a GPU.
Built-in EAGLE draft head with integrated training. Speculative decoding accelerates generation by drafting multiple tokens per step and verifying in parallel.
Fused dequantize-and-multiply kernel for Q4_K quantized weights — 14x faster than the unfused path. Keeps decode throughput high on quantized models.
Multi-LoRA per-request serving, quantized KV cache (Q4/Q3), and hybrid CPU/GPU MoE routing. Production-grade features for multi-tenant deployments.
GGUF files are memory-mapped by default. Tensor data is paged from NVMe on demand — the OS handles it. A 229B MiniMax-M2 (138 GB, 3 shards) runs on 128 GB with no flags and no configuration.
Benchmarked on NVIDIA DGX Spark (GB10), CUDA 13.0, Go 1.26. Gemma 3 1B Q4_K_M, 256 tokens.
| Runtime | Throughput | Notes |
|---|---|---|
| Zerfoo | Pure Go, zero CGo, CUDA graph capture, fused kernels | |
| Ollama | llama.cpp C++ backend, same model, same hardware |
| Date | Milestone | Tok/s | Improvement |
|---|---|---|---|
| Mar 27 | Multi-model benchmark (3-run median) | 241 | +28% vs Ollama |
| Mar 17 | Q4_0 re-quant restored | 245 | +32% vs regression |
| Mar 14 | CUDA graph capture | 234 | +26% vs non-graph |
| Mar 13 | GPU-first pipeline | 103 | D2H elimination |
| Mar 12 | ARM NEON SIMD | 8.15 | +18.8% CPU accel |
| Mar 10 | Initial CPU baseline | 3.60 | Starting point |
GGUF files are memory-mapped by default. Tensor data pages from NVMe on demand — no flags needed. Split GGUF shards are detected and mapped automatically.
| Model | Params | File size | RAM available | Status |
|---|---|---|---|---|
| MiniMax-M2 Q4_K_M | 229B MoE | 138 GB (3 shards) | 128 GB | Loads and generates text |
41 architectures across 25 model families. Load any GGUF model from HuggingFace.
Uses GGUF as the sole model format. Compatible with llama.cpp, Ollama, LM Studio, and GPT4All model files.
Pull models, run inference, and serve an OpenAI-compatible API from the command line.
# Install $ go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest # Pull a model from HuggingFace $ zerfoo pull gemma-3-1b-q4 # Interactive chat $ zerfoo run gemma-3-1b-q4 Model loaded. Type your message (Ctrl-D to quit). > What is the capital of France? The capital of France is Paris. # OpenAI-compatible API server $ zerfoo serve gemma-3-1b-q4 --port 8080 # QuaRot weight fusion for uniform 4-bit quantization $ zerfoo run --quarot model.gguf # Train an EAGLE speculative decoding head $ zerfoo eagle-train --model model.gguf --corpus data.txt --output eagle.gguf # Convert MHA model to Multi-head Latent Attention $ zerfoo transmla --input model.gguf --output model-mla.gguf # Multi-LoRA serving (per-request adapter selection) $ curl http://localhost:8080/v1/chat/completions \ -d '{"model":"gemma3-1b:my-lora","messages":[{"role":"user","content":"Hello!"}]}'
Six focused Go modules that compose together.
Core ML framework: inference, training, serving, and the OpenAI-compatible API.
GPU-accelerated tensor library, compute engine, computation graph, and CUDA kernels.
BPE tokenizer with HuggingFace compatibility. Zero external dependencies.
IEEE 754 half-precision (Float16) and BFloat16 arithmetic library.
FP8 E4M3FN arithmetic for quantized inference workloads.
ONNX-to-GGUF converter. Pure Go, CGo-free, standalone binary.
Deep dives into the engineering behind Zerfoo.
CUDA graph capture and fused kernels took Zerfoo from 186 tok/s to 241 tok/s. A deep dive into making the decode path GPU-only.
How Zerfoo runs GPU-accelerated inference without a single CGo call. dlopen, purego, and assembly trampolines.
From a custom protobuf format to GGUF — and deleting 7,500 lines of code in the process.
From zero to LLM inference without changing your build, deployment, or architecture.
One import. One function call. No Python, no CGo, no sidecar.