Running the OpenAI-Compatible API Server#

This tutorial shows how to serve a model over HTTP using the Zerfoo API server, which implements the OpenAI API specification. Any client library or tool that works with the OpenAI API can connect to Zerfoo with a one-line base URL change.

Starting the Server#

The simplest way to start serving is with the serve CLI command:

zerfoo serve gemma-3-1b-q4

This downloads the model (if not already cached), loads it, and starts an HTTP server on localhost:8080. You can customize the host and port:

zerfoo serve gemma-3-1b-q4 --host 0.0.0.0 --port 3000

For GPU inference:

zerfoo serve gemma-3-1b-q4 --device cuda

Available Endpoints#

The server exposes these OpenAI-compatible endpoints:

MethodPathDescription
POST/v1/chat/completionsChat completion (multi-turn conversation)
POST/v1/completionsText completion (single prompt)
POST/v1/embeddingsText embeddings
POST/v1/audio/transcriptionsAudio transcription (when a transcriber is configured)
GET/v1/modelsList loaded models
GET/v1/models/{id}Get model information
DELETE/v1/models/{id}Unload a model
GET/metricsPrometheus metrics
GET/openapi.yamlOpenAPI specification

Making Requests with curl#

Chat Completion#

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-3-1b-q4",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": 64
  }'

Streaming#

Add "stream": true to receive server-sent events (SSE) as tokens are generated:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-3-1b-q4",
    "messages": [
      {"role": "user", "content": "Write a poem about Go."}
    ],
    "stream": true,
    "max_tokens": 128
  }'

Each SSE event contains a JSON chunk with the delta token. The stream ends with data: [DONE].

Text Completion#

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-3-1b-q4",
    "prompt": "The Go programming language is",
    "max_tokens": 64,
    "temperature": 0.5
  }'

List Models#

curl http://localhost:8080/v1/models

Using with the OpenAI Python Client#

Any OpenAI-compatible client library works. Here is an example with the official Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",  # Zerfoo does not require an API key by default
)

response = client.chat.completions.create(
    model="gemma-3-1b-q4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain transformers in ML."},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

For streaming:

stream = client.chat.completions.create(
    model="gemma-3-1b-q4",
    messages=[{"role": "user", "content": "Write a haiku."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Starting the Server from Go Code#

You can embed the server directly in your Go application:

package main

import (
	"log"
	"net/http"

	"github.com/zerfoo/zerfoo/inference"
	"github.com/zerfoo/zerfoo/serve"
)

func main() {
	model, err := inference.LoadFile("gemma-3-1b-it-q4_0.gguf",
		inference.WithDevice("cuda"),
	)
	if err != nil {
		log.Fatal(err)
	}
	defer model.Close()

	srv := serve.NewServer(model)

	log.Println("Listening on :8080")
	log.Fatal(http.ListenAndServe(":8080", srv.Handler()))
}

Server Options#

The serve.NewServer function accepts options for logging, metrics, batch scheduling, speculative decoding, and multi-GPU distribution:

srv := serve.NewServer(model,
	serve.WithLogger(logger),
	serve.WithMetrics(metricsCollector),
	serve.WithBatchScheduler(batchScheduler),
	serve.WithDraftModel(draftModel),
	serve.WithGPUs([]int{0, 1}),
)

Speculative decoding: When a draft model is set with WithDraftModel, the server uses speculative decoding for all completion requests. A smaller, faster model proposes tokens and the target model verifies them in a single batched forward pass, improving decode throughput.

Batch scheduling: When a BatchScheduler is attached with WithBatchScheduler, incoming non-streaming requests are grouped into batches for higher throughput under load.

Prometheus Metrics#

The server exposes a /metrics endpoint in Prometheus format. Key metrics include:

  • Request count and latency per endpoint
  • Token generation rate (tokens per second)
  • Speculative decoding acceptance rate (when enabled)

Point your Prometheus scrape config at http://localhost:8080/metrics to collect these metrics.

Monitoring and Health#

The /v1/models endpoint serves as a lightweight health check. If the model is loaded and ready, it returns model metadata. After a DELETE /v1/models/{id} call, the model is unloaded and subsequent inference requests return an error.

The server includes built-in recovery middleware that catches panics during request handling and returns a 500 response instead of crashing the process.

Using with the OpenAI Go Client#

You can also use any Go HTTP client. Here is an example using the standard library:

package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
)

func main() {
	body := map[string]interface{}{
		"model": "gemma-3-1b-q4",
		"messages": []map[string]string{
			{"role": "user", "content": "What is Go?"},
		},
		"max_tokens": 64,
	}
	data, _ := json.Marshal(body)

	resp, err := http.Post("http://localhost:8080/v1/chat/completions",
		"application/json", bytes.NewReader(data))
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	out, _ := io.ReadAll(resp.Body)
	fmt.Println(string(out))
}

This works because the server speaks the same JSON schema as the OpenAI API. Any HTTP client in any language can send requests to Zerfoo without a dedicated SDK.

What is Next#