Production Deployment Guide#

This guide covers deploying zerfoo serve in production with TLS, monitoring, and reverse proxy configuration.

Quick Start#

zerfoo serve google/gemma-3-1b --port 8080

This starts an OpenAI-compatible API server on port 8080 with the following endpoints:

EndpointMethodDescription
/v1/chat/completionsPOSTChat completion (streaming/non-streaming)
/v1/completionsPOSTText completion (streaming/non-streaming)
/v1/embeddingsPOSTText embeddings
/v1/modelsGETList loaded models
/v1/models/{id}GET/DELETEGet or unload a model
/metricsGETPrometheus metrics
/openapi.yamlGETOpenAPI specification

CLI Flags#

FlagDefaultDescription
--port8080Listen port
--cache-dirSystem defaultOverride model cache directory

TLS / mTLS#

The serve package returns a standard http.Handler. To enable TLS, wrap it with Go’s tls.Config or terminate TLS at your reverse proxy (see nginx section below).

For direct TLS termination at the application level, embed the server in a custom main.go:

srv := serve.NewServer(model,
    serve.WithLogger(logger),
    serve.WithMetrics(collector),
)
httpServer := &http.Server{
    Addr:    ":8443",
    Handler: srv.Handler(),
    TLSConfig: &tls.Config{
        MinVersion: tls.VersionTLS13,
    },
}
httpServer.ListenAndServeTLS("server.crt", "server.key")

For mTLS (mutual TLS), add client certificate verification:

caCert, _ := os.ReadFile("ca.crt")
caCertPool := x509.NewCertPool()
caCertPool.AppendCertsFromPEM(caCert)

tlsConfig := &tls.Config{
    MinVersion: tls.VersionTLS13,
    ClientAuth: tls.RequireAndVerifyClientCert,
    ClientCAs:  caCertPool,
}

Prometheus Metrics#

The GET /metrics endpoint exposes metrics in Prometheus text exposition format. Available metrics:

MetricTypeDescription
requests_totalCounterTotal completed requests
tokens_generated_totalCounterTotal tokens generated
tokens_per_secondGaugeRolling token generation rate
request_latency_msHistogramRequest latency (buckets: 10, 50, 100, 250, 500, 1000, 2500, 5000, 10000 ms)

Prometheus scrape config:

scrape_configs:
  - job_name: zerfoo
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /metrics

Graceful Shutdown#

The server uses a shutdown.Coordinator that closes components in reverse registration order when the process receives SIGINT or SIGTERM:

  1. The HTTP server stops accepting new connections.
  2. In-flight requests are allowed to complete.
  3. The batch scheduler (if attached) is drained and stopped.
  4. The model is closed and GPU memory is released.

No special configuration is needed – the CLI wires this automatically.

systemd Unit File#

[Unit]
Description=Zerfoo Inference Server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=zerfoo
Group=zerfoo
ExecStart=/usr/local/bin/zerfoo serve google/gemma-3-1b --port 8080
Restart=on-failure
RestartSec=5
LimitNOFILE=65536
LimitMEMLOCK=infinity
Environment=HOME=/var/lib/zerfoo
WorkingDirectory=/var/lib/zerfoo

# GPU access
SupplementaryGroups=video render

# Hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/zerfoo

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable --now zerfoo
sudo journalctl -u zerfoo -f

Reverse Proxy (nginx)#

upstream zerfoo {
    server 127.0.0.1:8080;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    ssl_certificate     /etc/ssl/certs/api.example.com.crt;
    ssl_certificate_key /etc/ssl/private/api.example.com.key;
    ssl_protocols       TLSv1.3;

    # Streaming support -- disable buffering for SSE
    proxy_buffering off;
    proxy_cache off;

    # Timeouts for long-running inference requests
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;

    location / {
        proxy_pass http://zerfoo;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Connection "";
    }

    # Metrics -- restrict to internal network
    location /metrics {
        allow 10.0.0.0/8;
        allow 172.16.0.0/12;
        allow 192.168.0.0/16;
        deny all;
        proxy_pass http://zerfoo;
    }
}

Resource Sizing#

CPU-Only#

Model SizeRAMCPU CoresNotes
1B (Q4_K_M)2 GB4+Suitable for development and light traffic
3B (Q4_K_M)4 GB8+Good for moderate throughput
7B (Q4_K_M)8 GB8+Recommended minimum for production

GPU (CUDA)#

Model SizeVRAMSystem RAMNotes
1B (Q4_K_M)1 GB4 GBSingle consumer GPU
7B (Q4_K_M)6 GB8 GBRTX 3060 or better
13B (Q4_K_M)10 GB16 GBRTX 3080/4080 or better
70B (Q4_K_M)40 GB64 GBA100/H100 or multi-GPU

General Guidelines#

  • Memory: Model weights are memory-mapped. RSS will be close to the GGUF file size plus KV cache overhead. Set LimitMEMLOCK=infinity in systemd to prevent swapping.
  • File descriptors: Set LimitNOFILE=65536 for high-concurrency workloads.
  • GPU: Ensure the video and render groups are assigned for GPU access. Zerfoo loads GPU backends dynamically via purego – no CGo or special build flags are needed.
  • Batch scheduling: For throughput-oriented workloads (non-streaming), attach a BatchScheduler to group requests and improve GPU utilization.

For enterprise deployments with Kubernetes, multi-GPU inference, auto-scaling, and advanced security hardening, see the Enterprise Deployment Guide.