Migrating to Zerfoo v1.0#
This guide covers all breaking changes between Zerfoo v0.x and v1.0, with actionable migration steps for each.
Overview#
Zerfoo v1.0 is the first release with a backwards-compatibility guarantee (2 years through v1.x). The major changes are:
- ZMF model format removed – GGUF is the sole model format.
- Repository split – tensor/compute/graph packages moved to
ztensor; tokenizer moved toztoken. compute.Engine[T]interface frozen – new capabilities use extension interfaces.- CGo build tags removed – GPU bindings use purego/dlopen exclusively.
- High-level API introduced –
inference.Load/inference.LoadFilereplace manual GGUF loading. - KV cache interface unified –
CacheProvider[T]replaces concrete*KVCache[T]in context helpers. - Sub-package maturity labels – packages are labeled Stable, Beta, or Alpha with different compatibility guarantees.
Breaking Changes#
1. ZMF Model Format Removed (ADR-037)#
The ZMF/protobuf model loading path has been removed entirely. GGUF is the sole model format.
| Removed API | Replacement |
|---|---|
model.LoadZMF(path) | inference.LoadFile(path) with a .gguf file |
model.ExportZMF(path) | No direct replacement; use GGUF for checkpoints |
model.Builder (generic graph-from-ZMF) | Architecture-specific builders via inference.RegisterArchitecture |
graph.FuseRMSNorm() fusion pass | Not needed – GGUF builders emit fused ops directly |
model/tensor_encoder.go, tensor_decoder.go | Removed (protobuf serialization) |
model/adapters.go ZMF adapter code | Removed |
Migration steps:
- Convert any ZMF model files to GGUF using
zonnx(which now outputs GGUF instead of ZMF). - Replace
model.LoadZMF(path)calls withinference.LoadFile("model.gguf"). - Remove any
github.com/zerfoo/zmfimports – the zmf repository is archived. - Remove protobuf dependencies that were only used for ZMF model loading.
2. Repository Split (ADR-036)#
Tensor, compute, graph, and tokenizer packages have been extracted into independent repositories.
| Old import path | New import path |
|---|---|
github.com/zerfoo/zerfoo/tensor | github.com/zerfoo/ztensor/tensor |
github.com/zerfoo/zerfoo/compute | github.com/zerfoo/ztensor/compute |
github.com/zerfoo/zerfoo/graph | github.com/zerfoo/ztensor/graph |
github.com/zerfoo/zerfoo/numeric | github.com/zerfoo/ztensor/numeric |
github.com/zerfoo/zerfoo/device | github.com/zerfoo/ztensor/device |
github.com/zerfoo/zerfoo/types | github.com/zerfoo/ztensor/types |
github.com/zerfoo/zerfoo/log | github.com/zerfoo/ztensor/log |
github.com/zerfoo/zerfoo/pkg/tokenizer | github.com/zerfoo/ztoken |
Migration steps:
- Update import paths as shown above.
- Run
go mod tidyto add the newztensorandztokendependencies. - The types are identical – no code changes beyond import paths are needed.
3. Engine[T] Interface Frozen (ADR-058)#
The compute.Engine[T] interface is frozen for v1.x. New GPU capabilities are
exposed via optional extension interfaces checked with type assertions.
| Extension Interface | Purpose |
|---|---|
EngineWithFP8 | FP8 E4M3 compute operations |
EngineWithPagedKV | Paged KV cache memory management |
Migration steps:
If you have a custom Engine[T] implementation, it will continue to compile
unchanged. To opt into new capabilities, implement the relevant extension
interface:
// Before: would have required adding methods to Engine[T]
// After: implement the extension interface
type MyEngine struct { /* ... */ }
// Existing Engine[T] methods remain unchanged.
// Opt into FP8 by implementing EngineWithFP8.
func (e *MyEngine) FP8MatMul(a, b, out unsafe.Pointer, m, n, k int) error {
// ...
}4. CGo Build Tags Removed#
All GPU bindings (CUDA, ROCm, OpenCL, cuBLAS, cuDNN) now use purego/dlopen
exclusively. The cuda, rocm, and opencl build tags are no longer
recognized.
| Old build command | New build command |
|---|---|
go build -tags cuda ./... | go build ./... |
go test -tags cuda ./... | go test ./... |
go build -tags rocm ./... | go build ./... |
Migration steps:
- Remove
-tags cuda,-tags rocm, and-tags openclfrom all build scripts, CI pipelines, Dockerfiles, and Makefiles. - GPU acceleration is now detected at runtime. If the CUDA/ROCm/OpenCL shared
libraries are present on the system, they are loaded automatically via
dlopen. - Remove any CGo toolchain dependencies (gcc, nvcc for bindings) from your
build environment. The CUDA kernel shared library (
libzerfoo_kernels.so) is still compiled separately but is loaded at runtime.
5. High-Level API Changes#
The top-level zerfoo package now provides a stable convenience API.
| Old API (v0.x) | New API (v1.0) |
|---|---|
zerfoo.Load(pathOrID) returning (*zerfoo.Model, error) | Same signature, now marked Stable |
| Manual GGUF load + graph build + generator creation | inference.Load(modelID, opts...) or inference.LoadFile(path, opts...) |
The inference.Load function now accepts HuggingFace model IDs in addition
to local file paths. Short aliases like "gemma-3-1b-q4" are supported.
Before (v0.x):
gguf, err := inference.LoadGGUF("/path/to/model.gguf")
// manually build graph, create engine, create generator...
gen := generate.NewGenerator[float32](g, tok, engine, cfg)
text, _ := gen.Generate(ctx, "Hello", generate.DefaultSamplingConfig())After (v1.0):
m, err := inference.Load("gemma-3-1b-q4",
inference.WithDevice("cuda"),
inference.WithMaxSeqLen(4096),
)
if err != nil {
log.Fatal(err)
}
defer m.Close()
text, err := m.Generate(ctx, "Hello",
inference.WithMaxTokens(256),
inference.WithTemperature(0.7),
)The low-level generate.Generator API remains available for users who need
fine-grained control over the generation loop.
6. KV Cache Context Helpers#
The concrete-type KV cache context helpers are deprecated in favor of the
interface-based CacheProvider[T] versions.
| Deprecated | Replacement |
|---|---|
generate.WithKVCache[T](ctx, *KVCache[T]) | generate.WithCache[T](ctx, CacheProvider[T]) |
generate.GetKVCache[T](ctx) | generate.GetCache[T](ctx) |
The deprecated functions still work in v1.x but will be removed in v2.0.
Migration steps:
- Replace
generate.WithKVCachewithgenerate.WithCache. - Replace
generate.GetKVCachewithgenerate.GetCache. - Both
*KVCache[T]and*TensorCache[T]implementCacheProvider[T], so no other changes are needed.
7. Package Rename: timeseries#
The internal time-series inference package was renamed.
| Old import path | New import path |
|---|---|
inference/ts | inference/timeseries |
cmd/ts_train | cmd/ts_train |
Migration steps:
Update import paths accordingly.
Sub-Package Maturity Labels#
Packages are now labeled by maturity level, which determines their compatibility guarantee:
| Level | Guarantee | Packages |
|---|---|---|
| Stable | Full v1.x backwards compatibility | inference/, generate/, serve/, model/, layers/ |
| Beta | Schema preserved, behavior may change | training/, distributed/ |
| Alpha | May be restructured | training/nas/, training/automl/ |
Deprecation Policy#
Starting with v1.0, all deprecations follow this protocol:
- A
// Deprecated:comment is added to the symbol. - The deprecated symbol coexists with its replacement for at least 2 minor releases.
- Removal happens only in v2.0.
A deprecation linter (cmd/deprecation-check) is available to scan your code
for usage of deprecated symbols.
New Features in v1.0#
These are additive and do not require migration, but are worth knowing about:
- Architecture registry –
inference.RegisterArchitecture/inference.ListArchitecturesfor pluggable model support. - 28 architectures (16 model families) – Llama 3/4, Gemma 3/3n, Mistral, Qwen 2, Phi 3/4, DeepSeek V3, GPT-2, Nemotron-H, MiniMax M2, GLM-4, Kimi K2, LFM2, OLMo 2, EXAONE, StarCoder 2, InternLM 2, DBRX, Falcon, Command R, Mixtral, RWKV, Jamba, Mamba 3, Whisper, and more.
- Speculative decoding –
inference.Model.SpeculativeGenerateandgenerate.WithSpeculativeDraft. - Paged KV cache –
generate.WithPagedKVfor memory-efficient serving. - Prefix caching –
generate.WithPrefixCachefor shared system prompt reuse. - FP16 KV cache –
generate.WithGeneratorKVDtype("fp16")for 2x bandwidth reduction. - Grammar-constrained decoding –
inference.WithGrammarandserve.ResponseFormat{Type: "json_schema"}. - Tool calling –
serve.Tool/serve.ToolChoicein the OpenAI-compatible API. - Vision and audio – multimodal inference with LLaVA, SigLIP, and Whisper.
- Batch generation –
inference.Model.GenerateBatchandserve.BatchScheduler. - EAGLE speculative decoding –
generate.WithEAGLEfor speculative draft-and-verify with built-in head training. - Q4_K fused GEMV – 14x faster dequantize-and-multiply kernel for Q4_K quantized weights.
- TransMLA – MHA-to-MLA conversion for DeepSeek-style multi-head latent attention.
- Multi-LoRA per-request serving –
serve.WithLoRAroutes each request to a different LoRA adapter. - BitNet ternary inference – native 1.58-bit ternary weight support for BitNet models.
- Native Sparse Attention (NSA) – sparse attention patterns for long-context efficiency.
- Hybrid CPU/GPU MoE – expert routing across CPU and GPU for memory-constrained deployments.
- Quantized KV cache –
generate.WithKVQuant("q4")andgenerate.WithKVQuant("q3")for reduced KV memory. - Time-series inference – Granite TTM/FlowState models, 21x faster than Python granite-tsfm.
- Continuous batching –
serve.NewBatchSchedulerfor high-throughput serving. - LoRA/QLoRA fine-tuning –
training/lora/andcmd/finetune. - FSDP distributed training –
distributed/fsdp/with NCCL AllGather/ReduceScatter. - HuggingFace model downloads –
zerfoo pullCLI with resume and SHA256 verification. - SSM support – Mamba block, SSM state management, Jamba hybrid architecture.
Import Path Reference#
The v1.0 import path remains github.com/zerfoo/zerfoo (implicit v1 per Go
convention). When v2.0 is released, it will use github.com/zerfoo/zerfoo/v2.
import (
"github.com/zerfoo/zerfoo" // top-level convenience API
"github.com/zerfoo/zerfoo/inference" // model loading and generation
"github.com/zerfoo/zerfoo/generate" // low-level generation control
"github.com/zerfoo/zerfoo/serve" // OpenAI-compatible server
"github.com/zerfoo/zerfoo/training" // training framework (Beta)
"github.com/zerfoo/ztensor/tensor" // tensor types
"github.com/zerfoo/ztensor/compute" // compute engine interface
"github.com/zerfoo/ztensor/graph" // computation graph
"github.com/zerfoo/ztoken" // BPE tokenizer
)