Migrating from Ollama to Zerfoo#
Ollama is a popular tool for running LLMs locally. If you’re using Ollama today and want to switch to Zerfoo — whether for the 25% throughput improvement, Go-native embedding, or OpenAI API compatibility — this guide walks you through the migration step by step.
Why Migrate?#
Before diving into the how, here’s what Zerfoo offers over Ollama:
| Feature | Ollama | Zerfoo |
|---|---|---|
| Decode throughput (Gemma 3 1B Q4_K_M) | 188 tok/s | 241 tok/s (+25%) |
| Language | Go + CGo (wraps llama.cpp) | Pure Go (zero CGo) |
| Embeddable as library | No (separate process) | Yes (go get and import) |
| OpenAI-compatible API | Yes | Yes |
| CUDA graph capture | No | Yes (99.5% coverage) |
| Build requirements | C/C++ toolchain | go build only |
| Model format | GGUF | GGUF |
| Deployment | Daemon process | Single binary or library |
The biggest difference is architectural: Ollama runs as a separate daemon process that you communicate with via HTTP. Zerfoo can run as a library embedded directly in your Go application, eliminating the process boundary entirely.
Step 1: Install Zerfoo#
As a CLI#
go install github.com/zerfoo/zerfoo/cmd/zerfoo@latestThis gives you the zerfoo command, analogous to the ollama command.
As a Library#
go get github.com/zerfoo/zerfoo@latestThis adds Zerfoo to your Go module for use as an embedded library.
Step 2: Pull Your Models#
Zerfoo uses GGUF model files, the same format Ollama uses under the hood. You can either download models through Zerfoo’s CLI or use GGUF files you already have.
Using the CLI#
# Ollama:
ollama pull gemma3:1b
# Zerfoo equivalent:
zerfoo pull gemma-3-1b-q4Using Existing GGUF Files#
If you already have GGUF files (from Ollama, llama.cpp, or HuggingFace), Zerfoo can load them directly:
zerfoo run ./models/gemma-3-1b-it-Q4_K_M.ggufOllama stores its models in ~/.ollama/models/. The GGUF files within that directory work with Zerfoo without modification.
Step 3: Migrate Your API Calls#
Both Ollama and Zerfoo support the OpenAI chat completions API, so migration is often as simple as changing the base URL.
Starting the Server#
# Ollama:
ollama serve
# Listens on http://localhost:11434
# Zerfoo:
zerfoo serve gemma-3-1b-q4 --port 8080
# Listens on http://localhost:8080curl#
# Ollama (OpenAI-compatible endpoint):
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gemma3:1b","messages":[{"role":"user","content":"Hello!"}]}'
# Zerfoo:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gemma-3-1b-q4","messages":[{"role":"user","content":"Hello!"}]}'The request and response formats are identical — both follow the OpenAI specification.
Python (OpenAI SDK)#
from openai import OpenAI
# Ollama:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
# Zerfoo — just change the URL:
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
# Same code from here on:
response = client.chat.completions.create(
model="gemma-3-1b-q4",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)Streaming#
# Works identically with both Ollama and Zerfoo:
stream = client.chat.completions.create(
model="gemma-3-1b-q4",
messages=[{"role": "user", "content": "Tell me a story."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")Step 4: Embed in Your Go Application (Optional)#
This is where Zerfoo diverges most from Ollama. Instead of running a separate server process, you can embed inference directly in your Go application:
package main
import (
"fmt"
"log"
"net/http"
"github.com/zerfoo/zerfoo"
)
var model *zerfoo.Model
func main() {
var err error
model, err = zerfoo.Load("google/gemma-3-1b")
if err != nil {
log.Fatal(err)
}
defer model.Close()
http.HandleFunc("/chat", handleChat)
log.Fatal(http.ListenAndServe(":9090", nil))
}
func handleChat(w http.ResponseWriter, r *http.Request) {
reply, err := model.Chat("What is the capital of France?")
if err != nil {
http.Error(w, err.Error(), 500)
return
}
fmt.Fprint(w, reply)
}No sidecar process. No HTTP calls to a local inference server. The model runs in-process with your application.
With Streaming#
func handleStreamChat(w http.ResponseWriter, r *http.Request) {
flusher, ok := w.(http.Flusher)
if !ok {
http.Error(w, "streaming not supported", 500)
return
}
stream, err := model.ChatStream(r.Context(), "Tell me a joke.")
if err != nil {
http.Error(w, err.Error(), 500)
return
}
for tok := range stream {
if tok.Done {
break
}
fmt.Fprint(w, tok.Text)
flusher.Flush()
}
}With Generation Options#
result, err := model.Generate(ctx, "Explain quicksort.",
zerfoo.WithGenTemperature(0.7),
zerfoo.WithGenMaxTokens(256),
zerfoo.WithGenTopP(0.9),
)Step 5: GPU Acceleration#
Both Ollama and Zerfoo support GPU acceleration, but the mechanism differs.
Ollama bundles llama.cpp with CGo bindings. GPU support requires a C/C++ toolchain at build time and the appropriate CUDA/ROCm libraries.
Zerfoo loads GPU libraries dynamically at runtime via purego/dlopen. No C toolchain is needed at build time. If CUDA is available at runtime, Zerfoo uses it automatically:
import "github.com/zerfoo/zerfoo/inference"
m, err := inference.LoadFile("model.gguf",
inference.WithDevice("cuda"),
inference.WithDType("fp16"),
)To verify GPU detection:
zerfoo serve gemma-3-1b-q4 --port 8080
# Logs will show: "using CUDA device 0" or "using CPU"Step 6: Production Configuration#
For production deployments, Zerfoo supports features that go beyond what Ollama offers out of the box:
Prometheus Metrics#
zerfoo serve gemma-3-1b-q4 --port 8080 --metrics
# Metrics available at http://localhost:8080/metricsExposes request counts, latency histograms, tokens generated, GPU memory usage, and more.
Speculative Decoding#
Use a smaller draft model to accelerate generation:
zerfoo serve gemma-3-4b-q4 --port 8080 \
--draft-model gemma-3-1b-q4The draft model proposes tokens greedily, and the target model verifies them in a single batched forward pass. When the acceptance rate is high, this can significantly increase throughput.
Structured Output (JSON Mode)#
Zerfoo supports grammar-constrained decoding for structured output:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-3-1b-q4",
"messages": [{"role": "user", "content": "List 3 countries as JSON"}],
"response_format": {"type": "json_object"}
}'Model Compatibility#
Both frameworks use GGUF, so model compatibility is straightforward:
| Architecture | Ollama | Zerfoo |
|---|---|---|
| Llama 3 | Yes | Yes |
| Gemma 3 | Yes | Yes |
| Mistral | Yes | Yes |
| Qwen 2 | Yes | Yes |
| Phi 3/4 | Yes | Yes |
| DeepSeek V3 | Yes | Yes |
Any GGUF model that works with Ollama should work with Zerfoo. Both frameworks support the same quantization formats (Q4_K_M, Q4_0, Q8_0, FP16, etc.).
Quick Reference: Command Mapping#
| Task | Ollama | Zerfoo |
|---|---|---|
| Install | curl -fsSL ollama.com/install.sh | sh | go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest |
| Pull model | ollama pull gemma3:1b | zerfoo pull gemma-3-1b-q4 |
| Run interactively | ollama run gemma3:1b | zerfoo run gemma-3-1b-q4 |
| Start server | ollama serve | zerfoo serve gemma-3-1b-q4 --port 8080 |
| List models | ollama list | zerfoo list |
| API base URL | http://localhost:11434/v1 | http://localhost:8080/v1 |
| Embed in Go | Not supported | go get github.com/zerfoo/zerfoo |
Summary#
Migrating from Ollama to Zerfoo is straightforward because both use GGUF models and implement the OpenAI API. The main steps are:
- Install the Zerfoo CLI or add the library to your Go module
- Pull or reuse your existing GGUF model files
- Change the base URL in your API clients
- Optionally, embed inference directly in your Go application
The reward: 25% faster decode throughput, zero-CGo builds, in-process inference, and a single-binary deployment model.
go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest
zerfoo pull gemma-3-1b-q4
zerfoo serve gemma-3-1b-q4 --port 8080