Migrating from Ollama to Zerfoo#

Ollama is a popular tool for running LLMs locally. If you’re using Ollama today and want to switch to Zerfoo — whether for the 25% throughput improvement, Go-native embedding, or OpenAI API compatibility — this guide walks you through the migration step by step.

Why Migrate?#

Before diving into the how, here’s what Zerfoo offers over Ollama:

FeatureOllamaZerfoo
Decode throughput (Gemma 3 1B Q4_K_M)188 tok/s241 tok/s (+25%)
LanguageGo + CGo (wraps llama.cpp)Pure Go (zero CGo)
Embeddable as libraryNo (separate process)Yes (go get and import)
OpenAI-compatible APIYesYes
CUDA graph captureNoYes (99.5% coverage)
Build requirementsC/C++ toolchaingo build only
Model formatGGUFGGUF
DeploymentDaemon processSingle binary or library

The biggest difference is architectural: Ollama runs as a separate daemon process that you communicate with via HTTP. Zerfoo can run as a library embedded directly in your Go application, eliminating the process boundary entirely.

Step 1: Install Zerfoo#

As a CLI#

go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest

This gives you the zerfoo command, analogous to the ollama command.

As a Library#

go get github.com/zerfoo/zerfoo@latest

This adds Zerfoo to your Go module for use as an embedded library.

Step 2: Pull Your Models#

Zerfoo uses GGUF model files, the same format Ollama uses under the hood. You can either download models through Zerfoo’s CLI or use GGUF files you already have.

Using the CLI#

# Ollama:
ollama pull gemma3:1b

# Zerfoo equivalent:
zerfoo pull gemma-3-1b-q4

Using Existing GGUF Files#

If you already have GGUF files (from Ollama, llama.cpp, or HuggingFace), Zerfoo can load them directly:

zerfoo run ./models/gemma-3-1b-it-Q4_K_M.gguf

Ollama stores its models in ~/.ollama/models/. The GGUF files within that directory work with Zerfoo without modification.

Step 3: Migrate Your API Calls#

Both Ollama and Zerfoo support the OpenAI chat completions API, so migration is often as simple as changing the base URL.

Starting the Server#

# Ollama:
ollama serve
# Listens on http://localhost:11434

# Zerfoo:
zerfoo serve gemma-3-1b-q4 --port 8080
# Listens on http://localhost:8080

curl#

# Ollama (OpenAI-compatible endpoint):
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma3:1b","messages":[{"role":"user","content":"Hello!"}]}'

# Zerfoo:
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma-3-1b-q4","messages":[{"role":"user","content":"Hello!"}]}'

The request and response formats are identical — both follow the OpenAI specification.

Python (OpenAI SDK)#

from openai import OpenAI

# Ollama:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")

# Zerfoo — just change the URL:
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

# Same code from here on:
response = client.chat.completions.create(
    model="gemma-3-1b-q4",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Streaming#

# Works identically with both Ollama and Zerfoo:
stream = client.chat.completions.create(
    model="gemma-3-1b-q4",
    messages=[{"role": "user", "content": "Tell me a story."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Step 4: Embed in Your Go Application (Optional)#

This is where Zerfoo diverges most from Ollama. Instead of running a separate server process, you can embed inference directly in your Go application:

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/zerfoo/zerfoo"
)

var model *zerfoo.Model

func main() {
    var err error
    model, err = zerfoo.Load("google/gemma-3-1b")
    if err != nil {
        log.Fatal(err)
    }
    defer model.Close()

    http.HandleFunc("/chat", handleChat)
    log.Fatal(http.ListenAndServe(":9090", nil))
}

func handleChat(w http.ResponseWriter, r *http.Request) {
    reply, err := model.Chat("What is the capital of France?")
    if err != nil {
        http.Error(w, err.Error(), 500)
        return
    }
    fmt.Fprint(w, reply)
}

No sidecar process. No HTTP calls to a local inference server. The model runs in-process with your application.

With Streaming#

func handleStreamChat(w http.ResponseWriter, r *http.Request) {
    flusher, ok := w.(http.Flusher)
    if !ok {
        http.Error(w, "streaming not supported", 500)
        return
    }

    stream, err := model.ChatStream(r.Context(), "Tell me a joke.")
    if err != nil {
        http.Error(w, err.Error(), 500)
        return
    }
    for tok := range stream {
        if tok.Done {
            break
        }
        fmt.Fprint(w, tok.Text)
        flusher.Flush()
    }
}

With Generation Options#

result, err := model.Generate(ctx, "Explain quicksort.",
    zerfoo.WithGenTemperature(0.7),
    zerfoo.WithGenMaxTokens(256),
    zerfoo.WithGenTopP(0.9),
)

Step 5: GPU Acceleration#

Both Ollama and Zerfoo support GPU acceleration, but the mechanism differs.

Ollama bundles llama.cpp with CGo bindings. GPU support requires a C/C++ toolchain at build time and the appropriate CUDA/ROCm libraries.

Zerfoo loads GPU libraries dynamically at runtime via purego/dlopen. No C toolchain is needed at build time. If CUDA is available at runtime, Zerfoo uses it automatically:

import "github.com/zerfoo/zerfoo/inference"

m, err := inference.LoadFile("model.gguf",
    inference.WithDevice("cuda"),
    inference.WithDType("fp16"),
)

To verify GPU detection:

zerfoo serve gemma-3-1b-q4 --port 8080
# Logs will show: "using CUDA device 0" or "using CPU"

Step 6: Production Configuration#

For production deployments, Zerfoo supports features that go beyond what Ollama offers out of the box:

Prometheus Metrics#

zerfoo serve gemma-3-1b-q4 --port 8080 --metrics
# Metrics available at http://localhost:8080/metrics

Exposes request counts, latency histograms, tokens generated, GPU memory usage, and more.

Speculative Decoding#

Use a smaller draft model to accelerate generation:

zerfoo serve gemma-3-4b-q4 --port 8080 \
  --draft-model gemma-3-1b-q4

The draft model proposes tokens greedily, and the target model verifies them in a single batched forward pass. When the acceptance rate is high, this can significantly increase throughput.

Structured Output (JSON Mode)#

Zerfoo supports grammar-constrained decoding for structured output:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-3-1b-q4",
    "messages": [{"role": "user", "content": "List 3 countries as JSON"}],
    "response_format": {"type": "json_object"}
  }'

Model Compatibility#

Both frameworks use GGUF, so model compatibility is straightforward:

ArchitectureOllamaZerfoo
Llama 3YesYes
Gemma 3YesYes
MistralYesYes
Qwen 2YesYes
Phi 3/4YesYes
DeepSeek V3YesYes

Any GGUF model that works with Ollama should work with Zerfoo. Both frameworks support the same quantization formats (Q4_K_M, Q4_0, Q8_0, FP16, etc.).

Quick Reference: Command Mapping#

TaskOllamaZerfoo
Installcurl -fsSL ollama.com/install.sh | shgo install github.com/zerfoo/zerfoo/cmd/zerfoo@latest
Pull modelollama pull gemma3:1bzerfoo pull gemma-3-1b-q4
Run interactivelyollama run gemma3:1bzerfoo run gemma-3-1b-q4
Start serverollama servezerfoo serve gemma-3-1b-q4 --port 8080
List modelsollama listzerfoo list
API base URLhttp://localhost:11434/v1http://localhost:8080/v1
Embed in GoNot supportedgo get github.com/zerfoo/zerfoo

Summary#

Migrating from Ollama to Zerfoo is straightforward because both use GGUF models and implement the OpenAI API. The main steps are:

  1. Install the Zerfoo CLI or add the library to your Go module
  2. Pull or reuse your existing GGUF model files
  3. Change the base URL in your API clients
  4. Optionally, embed inference directly in your Go application

The reward: 25% faster decode throughput, zero-CGo builds, in-process inference, and a single-binary deployment model.

go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest
zerfoo pull gemma-3-1b-q4
zerfoo serve gemma-3-1b-q4 --port 8080