Enterprise Deployment Guide#
This guide covers deploying Zerfoo in production-grade enterprise environments: Kubernetes with Helm and the ZerfooInferenceService operator, multi-GPU inference, TLS/mTLS, Prometheus monitoring, adaptive batching, auto-scaling, model repositories, multi-model serving with LRU eviction, security hardening, and troubleshooting.
For single-node deployments with systemd and nginx, see the Production Deployment Guide.
1. Prerequisites and System Requirements#
Software#
| Requirement | Minimum | Recommended | Notes |
|---|---|---|---|
| Go | 1.25+ | latest stable | Required only for building from source |
| Kubernetes | 1.28+ | 1.30+ | autoscaling/v2 API required for HPA |
| Helm | 3.12+ | latest stable | For chart-based deployment |
| Linux kernel | 5.15+ | 6.1+ | 6.1+ recommended for NVIDIA open driver |
| NVIDIA GPU Operator | 24.3+ | latest | Installs drivers, device plugin, container toolkit |
| NVIDIA driver | 525+ | 550+ | CUDA 12.0+ support |
| CUDA | 12.0 | 12.4+ | Loaded dynamically at runtime via purego – no CGo needed |
Hardware – CPU-Only#
| Model Size | RAM | CPU Cores | Notes |
|---|---|---|---|
| 1B Q4_K_M | 2 GB | 4+ | Development and light traffic |
| 3B Q4_K_M | 4 GB | 8+ | Moderate throughput |
| 7B Q4_K_M | 8 GB | 8+ | Recommended minimum for production |
| 13B Q4_K_M | 16 GB | 16+ |
Hardware – GPU (CUDA / ROCm)#
| Model Size | VRAM | System RAM | GPU Examples |
|---|---|---|---|
| 1B Q4_K_M | 1 GB | 4 GB | RTX 3060 |
| 7B Q4_K_M | 6 GB | 8 GB | RTX 3080, A10 |
| 13B Q4_K_M | 10 GB | 16 GB | RTX 4080, A30 |
| 70B Q4_K_M | 40 GB | 64 GB | A100 80GB, H100, or multi-GPU |
Cluster Requirements#
- GPU nodes: NVIDIA GPU Operator installed, nodes labeled with
nvidia.com/gpu.present=true. Zerfoo loads CUDA at runtime viadlopen– no special build flags are needed. - Storage: A
PersistentVolumeprovisioner (e.g.,local-path, EBS CSI, GCE PD) for model weight storage. - Container registry access: Images are published to
ghcr.io/zerfoo/zerfoo. ConfigureimagePullSecretsif your cluster requires authentication.
Key Notes#
- No CGo required. Zerfoo loads GPU backends dynamically at runtime via
purego/dlopen. Build withgo build ./...everywhere; nocudabuild tag is needed for runtime GPU acceleration. - GGUF is the only supported model format. Ensure all models are in GGUF format
before deployment. Use
zonnxto convert ONNX models. - Model weights are memory-mapped. RSS will be close to the GGUF file size plus KV
cache overhead. Set
LimitMEMLOCK=infinityin systemd for non-Kubernetes deployments.
2. Installation#
Pre-built Container Images#
# Pull the latest release image
docker pull ghcr.io/zerfoo/zerfoo:latest
# Or pin to a specific version
docker pull ghcr.io/zerfoo/zerfoo:0.1.0The container image includes the zerfoo binary with all CLI commands
(serve, run, pull, predict, tokenize, worker).
Building from Source#
# CPU-only build (zero CGo, compiles everywhere)
go build -o zerfoo ./cmd/zerfoo
# Build container image
docker build -t ghcr.io/zerfoo/zerfoo:custom .No build tags are required for CPU-only operation. GPU acceleration is loaded dynamically at runtime when CUDA libraries are available on the host.
3. Kubernetes Deployment#
Helm Chart#
Zerfoo ships a Helm chart at deploy/helm/zerfoo/.
Install#
helm install zerfoo deploy/helm/zerfoo/ \
--namespace zerfoo \
--create-namespace \
--set model.name="google/gemma-3-1b" \
--set model.quantization="Q4_K_M"Key Values#
| Value | Default | Description |
|---|---|---|
replicaCount | 1 | Number of inference pods |
image.repository | ghcr.io/zerfoo/zerfoo | Container image |
image.tag | Chart appVersion | Image tag |
model.name | "" | Model ID (e.g., google/gemma-3-1b) |
model.quantization | Q4_K_M | Quantization level |
model.cacheDir | /models | Model cache directory inside the container |
server.port | 8080 | Inference server listen port |
resources.requests.cpu | 2 | CPU request |
resources.limits.memory | 8Gi | Memory limit |
autoscaling.enabled | false | Enable HPA |
autoscaling.minReplicas | 1 | Minimum replicas |
autoscaling.maxReplicas | 10 | Maximum replicas |
ingress.enabled | false | Enable Ingress resource |
GPU-Enabled Deployment#
# values-gpu.yaml
resources:
requests:
cpu: "4"
memory: 16Gi
limits:
cpu: "8"
memory: 32Gi
nvidia.com/gpu: "1"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.present: "true"
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: zerfoo-models
volumeMounts:
- name: model-storage
mountPath: /modelshelm install zerfoo deploy/helm/zerfoo/ \
--namespace zerfoo \
--create-namespace \
-f values-gpu.yaml \
--set model.name="meta-llama/Llama-3-8B"ZerfooInferenceService Operator#
The ZerfooInferenceService operator manages the lifecycle of Zerfoo inference
servers on Kubernetes. It reconciles custom resources into Deployments with
health probes, Prometheus scraping annotations, GPU resource requests, Services,
and HorizontalPodAutoscalers.
The health endpoints are provided by the health package:
| Endpoint | Description |
|---|---|
GET /healthz | Liveness probe – process is alive |
GET /readyz | Readiness probe – model is loaded and serving |
GET /health | Combined health check |
GET /debug/pprof/ | Runtime profiling (restrict to internal network) |
Custom Resource Definition#
apiVersion: zerfoo.feza.ai/v1
kind: ZerfooInferenceService
metadata:
name: llama3-8b
namespace: zerfoo
spec:
modelRef: "meta-llama/Llama-3-8B-Q4_K_M"
replicas: 3
minReplicas: 2
maxReplicas: 10
resources:
cpu: "4"
memory: "16Gi"
gpuMemory: "24Gi"
healthCheck:
path: "/health"
interval: 10s
timeout: 5sThe operator creates the following Kubernetes resources:
| Resource | Naming Convention | Purpose |
|---|---|---|
| Deployment | <name>-primary | Primary inference pods |
| Service | <name>-svc | ClusterIP service with selector app: <name> |
| HPA | <name>-hpa | Autoscaler (when minReplicas and maxReplicas are set) |
Canary Deployments#
The operator supports canary deployments with weighted traffic splitting:
apiVersion: zerfoo.feza.ai/v1
kind: ZerfooInferenceService
metadata:
name: llama3-8b
namespace: zerfoo
spec:
modelRef: "meta-llama/Llama-3-8B-Q4_K_M"
replicas: 3
minReplicas: 2
maxReplicas: 10
resources:
cpu: "4"
memory: "16Gi"
gpuMemory: "24Gi"
healthCheck:
path: "/health"
interval: 10s
timeout: 5s
canary:
modelRef: "meta-llama/Llama-3-8B-Q8_0"
weight: 10 # 10% traffic to canaryThis creates a <name>-canary Deployment alongside the primary, with the
Service distributing traffic according to the configured weights (90/10 in
this example).
Namespace and RBAC#
apiVersion: v1
kind: Namespace
metadata:
name: zerfoo-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: zerfoo-server
namespace: zerfoo-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: zerfoo-server
rules:
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: zerfoo-server
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: zerfoo-server
subjects:
- kind: ServiceAccount
name: zerfoo-server
namespace: zerfoo-systemThe Helm chart creates a dedicated ServiceAccount with
automountServiceAccountToken: false by default.
For the operator, create a scoped Role:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: zerfoo-operator
namespace: zerfoo
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "create", "update", "delete"]
- apiGroups: [""]
resources: ["services"]
verbs: ["get", "list", "create", "update"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["get", "list", "create", "update"]
- apiGroups: ["zerfoo.feza.ai"]
resources: ["zerfooinferenceservices"]
verbs: ["get", "list", "watch", "update"]Deployment Manifest#
apiVersion: apps/v1
kind: Deployment
metadata:
name: zerfoo-gemma-3-7b
namespace: zerfoo-system
labels:
app: zerfoo
model: gemma-3-7b
spec:
replicas: 2
selector:
matchLabels:
app: zerfoo
model: gemma-3-7b
template:
metadata:
labels:
app: zerfoo
model: gemma-3-7b
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: zerfoo-server
runtimeClassName: nvidia # omit for CPU-only
containers:
- name: zerfoo
image: ghcr.io/zerfoo/zerfoo:v1.0.0
command: ["zerfoo", "serve"]
args:
- "google/gemma-3-7b-it-q4_k_m"
- "--port"
- "8080"
- "--cache-dir"
- "/models"
# For multi-GPU, add: --gpus 0,1
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: health
containerPort: 8081
protocol: TCP
env:
- name: GOMAXPROCS
valueFrom:
resourceFieldRef:
resource: limits.cpu
divisor: "1"
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "24Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: models
mountPath: /models
readOnly: true
- name: tls-certs
mountPath: /etc/zerfoo/tls
readOnly: true
livenessProbe:
httpGet:
path: /healthz
port: health
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: health
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 6
volumes:
- name: models
persistentVolumeClaim:
claimName: zerfoo-models
- name: tls-certs
secret:
secretName: zerfoo-tls
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: zerfoo
model: gemma-3-7bAdjust livenessProbe.initialDelaySeconds based on model size – larger models
take longer to load into GPU memory. A 70B model on a single GPU may need
60-120 seconds.
Service and Ingress#
apiVersion: v1
kind: Service
metadata:
name: zerfoo-gemma-3-7b
namespace: zerfoo-system
spec:
selector:
app: zerfoo
model: gemma-3-7b
ports:
- name: http
port: 80
targetPort: http
- name: https
port: 443
targetPort: http
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: zerfoo
namespace: zerfoo-system
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-buffering: "off" # required for SSE streaming
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.example.com
secretName: zerfoo-tls-ingress
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: zerfoo-gemma-3-7b
port:
name: httpFor streaming (SSE) support, ensure your load balancer or Ingress controller
disables response buffering. The proxy-buffering: "off" annotation above
handles this for nginx.
4. Scaling#
Horizontal Pod Autoscaling#
Enable HPA via the Helm chart:
# values-autoscaling.yaml
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 80This creates a HorizontalPodAutoscaler using the autoscaling/v2 API:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: zerfoo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: zerfoo
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80Custom Metrics HPA#
For GPU workloads, scale on custom metrics instead of CPU. Use the
Prometheus Adapter to
expose tokens_per_second or request_latency_ms as Kubernetes custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: zerfoo-gemma-3-7b
namespace: zerfoo-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: zerfoo-gemma-3-7b
minReplicas: 2
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: tokens_per_second
target:
type: AverageValue
averageValue: "150" # scale out when avg TPS per pod drops below 150
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70You can also scale on request latency:
metrics:
- type: Pods
pods:
metric:
name: request_latency_ms_p99
target:
type: AverageValue
averageValue: "500"Disaggregated Prefill/Decode#
For high-throughput deployments, Zerfoo supports disaggregated serving where prefill and decode phases run on separate worker pools:
- Gateway: Routes requests, distributes KV blocks between prefill and decode workers using least-loaded scheduling.
- Prefill workers: Handle prompt processing (compute-intensive, benefits from high-bandwidth GPUs).
- Decode workers: Handle autoregressive token generation (memory-bandwidth bound).
Configure the gateway with separate worker pools:
# Prefill workers (compute-optimized GPU nodes)
apiVersion: apps/v1
kind: Deployment
metadata:
name: zerfoo-prefill
spec:
replicas: 2
template:
spec:
containers:
- name: zerfoo
image: ghcr.io/zerfoo/zerfoo:latest
args: ["worker", "--role", "prefill", "--port", "50051"]
resources:
limits:
nvidia.com/gpu: "1"
---
# Decode workers (memory-optimized GPU nodes)
apiVersion: apps/v1
kind: Deployment
metadata:
name: zerfoo-decode
spec:
replicas: 4
template:
spec:
containers:
- name: zerfoo
image: ghcr.io/zerfoo/zerfoo:latest
args: ["worker", "--role", "decode", "--port", "50052"]
resources:
limits:
nvidia.com/gpu: "1"5. Multi-GPU Inference#
Zerfoo distributes a single model across multiple GPUs using the --gpus flag,
which accepts a comma-separated list of NVIDIA device IDs.
CLI Usage#
# Single GPU (default -- uses GPU 0 if CUDA is available)
zerfoo serve meta-llama/llama-3-70b-q4_k_m
# Two-GPU tensor parallel (GPUs 0 and 1)
zerfoo serve meta-llama/llama-3-70b-q4_k_m --gpus 0,1
# Four-GPU for a 70B model at full precision
zerfoo serve meta-llama/llama-3-70b-q8_0 --gpus 0,1,2,3GPU IDs must be non-negative integers, unique, and correspond to physical device
ordinals as reported by nvidia-smi.
Go API#
import (
"github.com/zerfoo/zerfoo/inference"
"github.com/zerfoo/zerfoo/serve"
)
model, err := inference.Load("meta-llama/llama-3-70b-q4_k_m")
if err != nil {
log.Fatal(err)
}
srv := serve.NewServer(model,
serve.WithGPUs([]int{0, 1, 2, 3}), // distribute across 4 GPUs
serve.WithLogger(logger),
serve.WithMetrics(collector),
)Kubernetes Multi-GPU Pod#
resources:
limits:
nvidia.com/gpu: "4" # request 4 GPUs
args:
- "meta-llama/llama-3-70b-q4_k_m"
- "--port"
- "8080"
- "--gpus"
- "0,1,2,3"Set CUDA_VISIBLE_DEVICES in the environment when you need explicit device
mapping within a shared node:
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3"Sizing Guidelines#
| Model | Quantization | GPUs | VRAM Each |
|---|---|---|---|
| 7B | Q4_K_M | 1x | 6 GB |
| 13B | Q4_K_M | 1x | 10 GB |
| 70B | Q4_K_M | 2x A100 40GB | 40 GB |
| 70B | Q8_0 | 4x A100 40GB | 40 GB |
| 405B | Q4_K_M | 8x H100 80GB | 80 GB |
6. TLS / mTLS Configuration#
The serve package returns a standard http.Handler. TLS is configured at the Go
http.Server level or terminated at the ingress/proxy layer.
Ingress TLS (Recommended)#
Use the Helm chart’s built-in Ingress with TLS:
# values-tls.yaml
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: inference.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: zerfoo-tls
hosts:
- inference.example.comApplication-Level TLS (TLS 1.3)#
For end-to-end encryption without an Ingress controller, embed TLS directly in the server:
import (
"crypto/tls"
"net/http"
"github.com/zerfoo/zerfoo/serve"
)
srv := serve.NewServer(model,
serve.WithLogger(logger),
serve.WithMetrics(collector),
)
httpServer := &http.Server{
Addr: ":8443",
Handler: srv.Handler(),
TLSConfig: &tls.Config{
MinVersion: tls.VersionTLS13,
CurvePreferences: []tls.CurveID{
tls.X25519,
tls.CurveP256,
},
},
ReadHeaderTimeout: 10 * time.Second,
}
if err := httpServer.ListenAndServeTLS("server.crt", "server.key"); err != nil {
log.Fatal(err)
}Mutual TLS (mTLS)#
mTLS is required for service-to-service authentication in zero-trust environments.
For the HTTP serving layer, configure mTLS at the application level:
import (
"crypto/tls"
"crypto/x509"
"os"
)
caCert, err := os.ReadFile("/etc/zerfoo/tls/ca.crt")
if err != nil {
log.Fatal(err)
}
caCertPool := x509.NewCertPool()
caCertPool.AppendCertsFromPEM(caCert)
cert, err := tls.LoadX509KeyPair(
"/etc/zerfoo/tls/server.crt",
"/etc/zerfoo/tls/server.key",
)
if err != nil {
log.Fatal(err)
}
tlsConfig := &tls.Config{
MinVersion: tls.VersionTLS13,
Certificates: []tls.Certificate{cert},
ClientAuth: tls.RequireAndVerifyClientCert,
ClientCAs: caCertPool,
}
httpServer := &http.Server{
Addr: ":8443",
Handler: srv.Handler(),
TLSConfig: tlsConfig,
}Or use a service mesh (Istio, Linkerd) for transparent mTLS between all pods.
Certificate Management with cert-manager#
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: zerfoo-tls
namespace: zerfoo-system
spec:
secretName: zerfoo-tls
duration: 2160h # 90 days
renewBefore: 360h # renew 15 days before expiry
subject:
organizations:
- Feza Inc
dnsNames:
- api.example.com
- zerfoo-gemma-3-7b.zerfoo-system.svc.cluster.local
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuerNginx TLS Termination (Reverse Proxy)#
upstream zerfoo {
server 127.0.0.1:8080;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name api.example.com;
ssl_certificate /etc/ssl/certs/api.example.com.crt;
ssl_certificate_key /etc/ssl/private/api.example.com.key;
ssl_protocols TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
# Required for SSE streaming (token-by-token responses)
proxy_buffering off;
proxy_cache off;
# Long inference requests
proxy_read_timeout 300s;
proxy_send_timeout 300s;
location / {
proxy_pass http://zerfoo;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Connection "";
}
# Restrict metrics to internal network
location /metrics {
allow 10.0.0.0/8;
allow 172.16.0.0/12;
allow 192.168.0.0/16;
deny all;
proxy_pass http://zerfoo;
}
}7. Monitoring and Observability#
Prometheus Metrics#
Every Zerfoo server exposes a GET /metrics endpoint in Prometheus text
exposition format.
| Metric | Type | Description |
|---|---|---|
requests_total | Counter | Total completed inference requests |
tokens_generated_total | Counter | Total tokens generated across all requests |
tokens_per_second | Gauge | Rolling token generation rate (tok/s) |
speculative_acceptance_rate | Gauge | Speculative decoding acceptance rate (when enabled) |
request_latency_ms | Histogram | Request latency distribution |
Histogram buckets: 10, 50, 100, 250, 500, 1000, 2500, 5000, 10000 ms.
Prometheus Scrape Configuration#
Static target:
scrape_configs:
- job_name: zerfoo
scrape_interval: 15s
static_configs:
- targets:
- "zerfoo-host:8080"
metrics_path: /metricsKubernetes pod discovery:
scrape_configs:
- job_name: zerfoo-pods
scrape_interval: 15s
kubernetes_sd_configs:
- role: pod
namespaces:
names: [zerfoo-system]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__Or use a PodMonitor if you have the Prometheus Operator:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: zerfoo
namespace: zerfoo
spec:
selector:
matchLabels:
app.kubernetes.io/name: zerfoo
podMetricsEndpoints:
- port: http
path: /metrics
interval: 15sGrafana Dashboard#
Recommended panels for a Zerfoo operations dashboard:
| Panel | PromQL Query | Description |
|---|---|---|
| Request Rate | rate(requests_total[5m]) | Requests per second |
| Token Throughput | rate(tokens_generated_total[5m]) | Tokens per second (cluster-wide) |
| Tokens/s per Pod | tokens_per_second | Per-pod generation rate |
| P50 Latency | histogram_quantile(0.5, rate(request_latency_ms_bucket[5m])) | Median request latency |
| P95 Latency | histogram_quantile(0.95, rate(request_latency_ms_bucket[5m])) | 95th percentile latency |
| P99 Latency | histogram_quantile(0.99, rate(request_latency_ms_bucket[5m])) | Tail latency |
| Speculative Acceptance | speculative_acceptance_rate | Draft model acceptance rate |
| GPU Memory | nvidia_gpu_memory_used_bytes (from DCGM exporter) | GPU memory utilization |
Alerting Rules#
groups:
- name: zerfoo
rules:
- alert: ZerfooHighLatency
expr: |
histogram_quantile(0.99, rate(request_latency_ms_bucket[5m])) > 5000
for: 2m
labels:
severity: warning
annotations:
summary: "p99 request latency above 5 s"
description: "p99 latency is {{ $value }}ms"
- alert: ZerfooLowThroughput
expr: tokens_per_second < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Token throughput critically low"
description: "Tokens/s: {{ $value }}"
- alert: ZerfooDown
expr: up{job="zerfoo-pods"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Zerfoo instance down"
- alert: ZerfooNoRequests
expr: rate(requests_total[5m]) == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Zerfoo is not processing any requests"
- alert: ZerfooOOM
expr: increase(requests_total{status="503"}[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Zerfoo returning 503 (possible OOM)"Health Checks#
The Helm chart configures liveness and readiness probes by default:
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 5Structured Logging#
Zerfoo logs every request with structured fields:
method=POST path=/v1/chat/completions model=gemma-3-1b prompt_tokens=0 completion_tokens=0 latency_ms=142 status_code=200Collect logs via your cluster’s logging stack (Fluentd, Loki, CloudWatch).
Filter on status_code >= 500 for error alerting.
8. Model Management#
Model Format#
Zerfoo uses GGUF as its sole model format. GGUF files are memory-mapped for efficient loading and support quantized formats (Q4_K_M, Q8_0, F16, F32).
Model Loading#
Models are loaded at startup via the model.name Helm value. The server runs
zerfoo serve <model-id> which:
- Resolves the model ID to a GGUF file (local path or HuggingFace download).
- Memory-maps the model weights.
- Builds the computation graph for the model architecture.
- Compiles the graph (with optional CUDA graph capture on GPU).
Persistent Model Storage#
Use a PersistentVolumeClaim to avoid re-downloading models on pod restart:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: zerfoo-models
namespace: zerfoo-system
spec:
accessModes:
- ReadOnlyMany # ReadWriteOnce if single-node
storageClassName: efs-sc # replace with your StorageClass
resources:
requests:
storage: 200GiReference in Helm values:
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: zerfoo-models
volumeMounts:
- name: model-storage
mountPath: /modelsModel Repository#
The serve/repository package implements a local filesystem model repository.
Models are stored as {baseDir}/{modelID}/model.gguf with a metadata.json
sidecar. SHA-256 is computed and stored on upload.
Directory Layout#
/models/
llama-3-7b-q4_k_m/
model.gguf
metadata.json
gemma-3-7b-it-q4_k_m/
model.gguf
metadata.jsonGo API#
import "github.com/zerfoo/zerfoo/serve/repository"
repo, err := repository.NewFileSystemRepository("/models")
if err != nil {
log.Fatal(err)
}
// Upload a model
f, _ := os.Open("gemma-3-7b-it-q4_k_m.gguf")
err = repo.Upload(repository.ModelMetadata{
ID: "gemma-3-7b-it-q4_k_m",
Name: "Gemma 3 7B IT Q4_K_M",
Version: "v1.0",
Format: "gguf",
}, f)
// List models
models, err := repo.List()
// Get model metadata
meta, err := repo.Get("gemma-3-7b-it-q4_k_m")
fmt.Printf("Size: %d bytes, SHA256: %s\n", meta.Size, meta.SHA256)
// Delete a model
err = repo.Delete("gemma-3-7b-it-q4_k_m")Model Pre-population (Init Container)#
initContainers:
- name: model-pull
image: ghcr.io/zerfoo/zerfoo:v1.0.0
command: ["zerfoo", "pull", "google/gemma-3-7b-it-q4_k_m", "--cache-dir", "/models"]
volumeMounts:
- name: models
mountPath: /modelsModel Version Registry#
The serve/registry/ package provides a bbolt-backed model version registry
for tracking, activating, and managing model versions. It supports:
- Registering model versions with metadata and performance metrics
- A/B testing between model versions
- Canary deployments with traffic splitting
- Shadow mode for comparing model outputs without affecting production traffic
Supported Architectures#
| Architecture | Status | Notes |
|---|---|---|
| Llama 3 | Production | RoPE theta=500K |
| Gemma 3 | Production | Tied embeddings, QK norms, logit softcap |
| Mistral | Production | Sliding window attention |
| Qwen 2 | Production | Attention bias, RoPE theta=1M |
| Phi 3/4 | Production | Partial rotary factor |
| DeepSeek V3 | Production | MLA, MoE |
9. Multi-Model Serving with LRU Eviction#
The serve/multimodel package provides a ModelManager that loads multiple models
on-demand within a GPU memory budget, evicting the least-recently-used model when
a new load would exceed the budget.
Architecture#
Request -> ModelManager.Get("model-id")
|
+-- Already loaded? -> promote to MRU, return handle
|
+-- Not loaded:
Evict LRU models until usedBytes + newSize <= MaxGPUMemoryBytes
Load model via ModelLoader.Load()
Track in LRU listConfiguration#
import "github.com/zerfoo/zerfoo/serve/multimodel"
manager, err := multimodel.NewModelManager(loader, multimodel.Config{
MaxGPUMemoryBytes: 40 * 1024 * 1024 * 1024, // 40 GB VRAM budget
PreloadModels: []string{
"gemma-3-7b-it-q4_k_m", // preloaded at startup
"llama-3-1b-q4_k_m",
},
})
if err != nil {
log.Fatal(err)
}
defer manager.Close()| Field | Description |
|---|---|
MaxGPUMemoryBytes | Total VRAM budget. LRU eviction triggers when exceeded. |
PreloadModels | Model IDs loaded eagerly at startup. Any failure aborts init. |
Implementing ModelLoader#
type GGUFLoader struct {
cacheDir string
}
func (l *GGUFLoader) Load(ctx context.Context, modelID string) (io.Closer, int64, error) {
path := filepath.Join(l.cacheDir, modelID, "model.gguf")
info, err := os.Stat(path)
if err != nil {
return nil, 0, err
}
model, err := inference.Load(modelID, inference.WithCacheDir(l.cacheDir))
if err != nil {
return nil, 0, err
}
return model, info.Size(), nil
}Runtime Operations#
// Get a model (loads if not cached, evicts LRU if needed)
model, err := manager.Get(ctx, "deepseek-v3-q4_k_m")
// Explicit eviction
err = manager.Unload("llama-3-1b-q4_k_m")
// Inspect state
loadedIDs := manager.Loaded()
usedBytes := manager.UsedBytes()Kubernetes Multi-Model Deployment#
For deployments serving many models from a single pod, increase the memory budget and mount a larger model store:
resources:
limits:
nvidia.com/gpu: "2"
memory: "128Gi"
env:
- name: ZERFOO_MAX_GPU_MEMORY_GB
value: "80" # 2x A100 40GB10. Performance Tuning#
Quantization#
Choose quantization based on your latency/quality trade-off:
| Quantization | Memory | Quality | Speed |
|---|---|---|---|
| F32 | 4x | Baseline | Slowest |
| F16 | 2x | Near-lossless | Moderate |
| Q8_0 | 1x | Minimal degradation | Fast |
| Q4_K_M | 0.5x | Good quality/size ratio | Fastest |
Set via Helm:
model:
quantization: "Q4_K_M"Batch Scheduling#
Continuous Batching#
The batcher.Scheduler implements continuous batching – variable-length (ragged)
batches with zero padding, immediate eviction of completed sequences, and slot
back-fill without stalling active requests. This typically achieves 2x throughput
over fixed batching.
import "github.com/zerfoo/zerfoo/serve/batcher"
scheduler := batcher.New(
16, // maxBatchSize -- max concurrent active sequences
func(ctx context.Context, batch *batcher.StepBatch) {
// Run one forward pass; append one token to each Slot.GeneratedToks
// and set Slot.Done = true when EOS or max tokens reached.
runForwardPass(ctx, batch)
},
batcher.WithPollInterval(1*time.Millisecond),
)
scheduler.Start()
defer scheduler.Stop()Adaptive Batching#
The adaptive.Batcher dynamically adjusts batch size based on queue depth and
latency EMA (exponential moving average, alpha=0.3):
- Scale up: queue depth >= current batch size AND latency EMA <= target -> double
batch size (capped at
MaxBatchSize). - Scale down: latency EMA > target -> reduce batch size by 25%.
- Hold: otherwise.
import "github.com/zerfoo/zerfoo/serve/adaptive"
batcher := adaptive.New(adaptive.Config{
MinBatchSize: 1,
MaxBatchSize: 32,
TargetLatencyMS: 200.0, // target p50 latency SLO in ms
QueueTimeoutMS: 50.0, // max wait to fill a batch before dispatching
}, handler)
batcher.Start()
defer batcher.Stop()| Field | Default | Description |
|---|---|---|
MinBatchSize | 1 | Minimum batch size |
MaxBatchSize | 32 | Maximum batch size |
TargetLatencyMS | 100 | Latency SLO in ms; controls scale-down |
QueueTimeoutMS | 50 | Max wait time (ms) to fill a batch |
Wiring into the HTTP Server#
import "github.com/zerfoo/zerfoo/serve"
bs := serve.NewBatchScheduler(serve.BatchConfig{
MaxBatchSize: 8,
BatchTimeout: 10 * time.Millisecond,
// Handler is auto-wired to model.GenerateBatch when nil
})
bs.Start()
srv := serve.NewServer(model,
serve.WithBatchScheduler(bs),
serve.WithGPUs([]int{0, 1}),
serve.WithLogger(logger),
serve.WithMetrics(collector),
)CUDA Graph Capture#
On GPU, Zerfoo captures the inference computation graph as a CUDA graph on first execution, then replays it for subsequent requests. This eliminates kernel launch overhead and achieves up to 99.5% instruction coverage on the GGUF inference path.
No configuration is needed – CUDA graph capture is automatic when a GPU is available.
Speculative Decoding#
Enable speculative decoding with a small draft model to increase throughput for large models:
srv := serve.NewServer(targetModel,
serve.WithDraftModel(draftModel),
)Monitor the speculative_acceptance_rate metric to verify the draft model
is effective. Acceptance rates above 70% typically yield significant speedups.
Memory#
Model weights are memory-mapped. Pod RSS will be close to the GGUF file size plus KV cache overhead. Set resource limits accordingly and avoid memory overcommit on GPU nodes.
11. Security Hardening#
Network#
- Terminate TLS 1.3 at ingress or application level – never serve HTTP in production.
- Restrict
GET /metricsto the internal monitoring network (Prometheus scraper IP range), not the public internet. - Restrict
GET /debug/pprof/to internal networks only. - Use mTLS for service-to-service communication (e.g., load balancer -> server, or distributed training gRPC channels).
- Apply Kubernetes
NetworkPolicyto limit pod-to-pod traffic to only required ports (8080 inference, 8081 health).
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: zerfoo-inference
namespace: zerfoo-system
spec:
podSelector:
matchLabels:
app: zerfoo
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- port: 8080
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
ports:
- port: 8080 # metrics scrape
- port: 8081 # health probes
egress:
# Allow DNS resolution
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
# Allow model downloads (if pulling at startup)
- to:
- ipBlock:
cidr: 0.0.0.0/0
ports:
- protocol: TCP
port: 443Container Hardening#
- Run as a non-root user (UID 1000).
- Set
readOnlyRootFilesystem: true– mount/tmpasemptyDirif needed. - Set
allowPrivilegeEscalation: false. - Drop all Linux capabilities; add only
IPC_LOCKif huge pages are required. - Use a minimal base image (distroless or scratch + binary).
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALLSecrets Management#
- Store TLS private keys in Kubernetes
Secrets(typekubernetes.io/tls), not in ConfigMaps or container images. - Rotate certificates automatically with cert-manager.
- Use external secret stores (AWS Secrets Manager, Vault) for API keys and credentials; mount via the Secrets Store CSI driver.
- Never log request bodies that may contain sensitive user data.
Store model repository credentials as Kubernetes Secrets:
apiVersion: v1
kind: Secret
metadata:
name: zerfoo-model-credentials
namespace: zerfoo
type: Opaque
stringData:
HF_TOKEN: "hf_xxxxxxxxxxxxxxxxxxxxx"Reference in the Deployment:
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: zerfoo-model-credentials
key: HF_TOKENFor production, use an external secrets operator (e.g., External Secrets Operator, HashiCorp Vault) to inject secrets from your secrets manager.
RBAC#
The Helm chart creates a dedicated ServiceAccount with
automountServiceAccountToken: false by default.
Pod Security#
- Apply a
PodDisruptionBudgetto ensure at least one replica stays available during node maintenance.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: zerfoo-gemma-3-7b
namespace: zerfoo-system
spec:
minAvailable: 1
selector:
matchLabels:
app: zerfoo
model: gemma-3-7b- Enable Kubernetes Audit Logging for all API server requests in the
zerfoo-systemnamespace. - Use
AppArmororSeccompprofiles on production nodes.
Model Integrity#
- Verify the SHA-256 of GGUF files before loading using the checksum stored
in
repository.ModelMetadata.SHA256(computed on upload). - Sign model artifacts and verify signatures in CI before publishing to the model repository.
12. Troubleshooting#
Server Does Not Start#
Symptom: zerfoo serve exits immediately or the readiness probe fails.
Checks:
- Verify the GGUF file exists and is readable:
ls -lh /models/gemma-3-7b-it-q4_k_m/model.gguf - Check the model ID matches the directory name and
metadata.json. - Confirm sufficient RAM/VRAM is available:
free -h nvidia-smi --query-gpu=memory.free --format=csv - Ensure the
videoandrendergroups are assigned (Linux GPU access):groups zerfoo
CUDA / GPU Not Detected#
Symptom: Server starts but runs on CPU; GPU utilization stays at 0%.
Checks:
- Confirm
libcuda.sois on the library path:ldconfig -p | grep libcuda - Verify driver version supports CUDA 12.0+:
nvidia-smi - In Kubernetes, confirm the NVIDIA device plugin is running and the pod has
nvidia.com/gpu: "1"in its resource limits. - Check
CUDA_VISIBLE_DEVICESis not set to an empty string orNoDevFiles.
Out-of-Memory (OOM) Errors#
Symptom: HTTP 503 responses with out of memory in the body.
Remediation:
- Reduce model quantization (e.g., switch from Q8 to Q4_K_M).
- Decrease
MaxBatchSizeon theBatchScheduleroradaptive.Batcher. - Add more GPUs (
--gpus 0,1,2,3). - For multi-model: reduce
MaxGPUMemoryBytesinmultimodel.Configto leave headroom for KV cache.
High Latency / Low Throughput#
Symptom: tokens_per_second gauge is below 50; p99 latency exceeds SLO.
Checks:
- Inspect adaptive batcher state:
log.Printf("batch_size=%d latency_ema=%.1fms", batcher.BatchSize(), batcher.LatencyEMA()) - If
LatencyEMAis high, reduceTargetLatencyMSorMaxBatchSizeto shed load. - Check for CUDA graph capture misses – examine startup logs for graph compilation warnings.
- Profile with pprof:
go tool pprof http://localhost:8081/debug/pprof/profile?seconds=30
Pod Stuck in Pending#
- Check GPU availability:
kubectl describe node <node> | grep nvidia.com/gpu - Verify NVIDIA GPU Operator is running:
kubectl get pods -n gpu-operator - Check PVC binding:
kubectl get pvc -n zerfoo
Pod in CrashLoopBackOff#
- Check logs:
kubectl logs -n zerfoo deploy/zerfoo --previous - Common causes:
- Model not found (invalid
model.name) - Insufficient memory (increase
resources.limits.memory) - GPU out of memory (use a smaller quantization or more GPUs)
- Model not found (invalid
Streaming (SSE) Broken at Proxy#
Symptom: Streaming responses are buffered and delivered all at once.
Fix: Disable proxy buffering. For nginx:
proxy_buffering off;
proxy_cache off;For the Kubernetes ingress-nginx controller:
annotations:
nginx.ingress.kubernetes.io/proxy-buffering: "off"Model Not Found (Multi-Model)#
Symptom: ModelManager.Get() returns load model "foo": ... error.
Checks:
- Confirm the model exists in the repository:
meta, err := repo.Get("foo") - Verify
MaxGPUMemoryBytesis large enough to hold at least one model. - Check
PreloadModelslist – a failed preload abortsNewModelManager.
Certificate / TLS Errors#
Symptom: Clients get certificate signed by unknown authority or handshake failure.
Checks:
- Confirm the CA certificate is correct in the
ClientCAspool (mTLS) or system trust store (one-way TLS). - Verify the certificate’s
dnsNamesincludes the hostname clients connect to. - Check certificate expiry:
openssl s_client -connect api.example.com:443 < /dev/null 2>/dev/null \ | openssl x509 -noout -dates - With cert-manager, inspect
Certificatestatus:kubectl describe certificate zerfoo-tls -n zerfoo-system
Graceful Shutdown Hangs#
Symptom: Pod takes longer than terminationGracePeriodSeconds to stop.
Checks:
- Increase the Kubernetes
terminationGracePeriodSeconds(default 30 s) to accommodate the longest expected in-flight request. - Ensure the
shutdown.Coordinatoris registered with both the HTTP server and theBatchScheduler. - Check for goroutine leaks with pprof:
curl http://localhost:8081/debug/pprof/goroutine?debug=1
Debug Logging#
Set the log level via environment variable:
env:
- name: ZERFOO_LOG_LEVEL
value: "debug"Useful kubectl Commands#
# Check pod status and events
kubectl get pods -n zerfoo -o wide
kubectl describe pod -n zerfoo <pod-name>
# Stream logs
kubectl logs -n zerfoo deploy/zerfoo -f
# Check metrics endpoint
kubectl port-forward -n zerfoo svc/zerfoo 8080:80
curl localhost:8080/metrics
# Check model info
curl localhost:8080/v1/models
# Test inference
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gemma-3-1b", "messages": [{"role": "user", "content": "Hello"}]}'
# Check HPA status
kubectl get hpa -n zerfoo