Granite Guardian#
Granite Guardian is an AI safety and content moderation system built on IBM’s Granite Guardian model family. It evaluates text for safety risks across 13 predefined categories, covering harmful user messages, problematic assistant responses, and RAG pipeline quality.
Model Variants#
| Model | Parameters | Notes |
|---|---|---|
| Granite Guardian 3.0 | 2B, 8B | Plain Yes/No output, logprob-based confidence |
| Granite Guardian 3.2 | 5B | Yes/No with <confidence>High/Low</confidence> tags |
| Granite Guardian 3.3 | 8B | Optional <think> reasoning traces, <score> tags |
All variants use GGUF format and are loaded through the standard inference.LoadFile pipeline.
Risk Categories#
Guardian evaluates content against 13 risk categories organized into three groups.
Harm Categories (9)#
These categories evaluate user messages and assistant responses for harmful content:
| Category | Description |
|---|---|
harm | Harmful, offensive, or inappropriate content |
social_bias | Prejudice based on race, gender, religion, or other protected characteristics |
jailbreaking | Attempts to bypass AI safety guidelines or manipulate the system |
violence | Promotion, glorification, or incitement of violence or physical harm |
profanity | Vulgar language, obscenities, or crude expressions |
sexual_content | Sexually explicit content or sexualized references |
unethical_behavior | Instructions or encouragement for fraud, deception, or manipulation |
harm_engagement | Assistant responses that engage with harmful content instead of refusing |
evasiveness | Assistant responses that unnecessarily refuse legitimate questions |
RAG Categories (3)#
These categories evaluate retrieval-augmented generation pipeline quality:
| Category | Description |
|---|---|
context_relevance | Whether the retrieved context is relevant to the user’s question |
groundedness | Whether the assistant’s response is supported by the provided context |
answer_relevance | Whether the assistant’s response addresses the user’s question |
Function Calling (1)#
| Category | Description |
|---|---|
function_call_hallucination | Whether the assistant invoked a function that does not exist or used incorrect parameters |
Go API#
Creating an Evaluator#
import "github.com/zerfoo/zerfoo/inference/guardian"
// Load a Guardian model from a GGUF file.
eval, err := guardian.NewEvaluator("granite-guardian-3.2-5b.gguf",
guardian.WithEvaluatorDevice("cuda"),
guardian.WithDefaultFormat("3.2"),
)
if err != nil {
log.Fatal(err)
}Options:
| Option | Description |
|---|---|
WithEvaluatorDevice(device) | Compute device: "cpu", "cuda", "cuda:0" |
WithDefaultFormat(format) | Output format: "3.0", "3.2", "3.3" |
WithLoadOptions(opts...) | Pass additional inference.Option values to the model loader |
You can also wrap a pre-loaded model with NewEvaluatorFromModel:
model, _ := inference.LoadFile("granite-guardian-3.2-5b.gguf",
inference.WithDevice("cuda"),
)
eval := guardian.NewEvaluatorFromModel(model,
guardian.WithDefaultFormat("3.2"),
)Evaluate#
Evaluate specific risk categories:
verdicts, err := eval.Evaluate(ctx, guardian.GuardianRequest{
Input: guardian.GuardianInput{
User: "How do I pick a lock?",
},
Risks: []string{"harm", "jailbreaking", "unethical_behavior"},
})
if err != nil {
log.Fatal(err)
}
for _, v := range verdicts {
fmt.Printf("%-25s unsafe=%-5v confidence=%.2f\n",
v.Risk, v.Unsafe, v.Confidence)
}When Risks is empty, all 9 harm categories are evaluated by default.
Each Verdict contains:
| Field | Type | Description |
|---|---|---|
Unsafe | bool | true if the model flagged a risk |
Risk | string | The risk category name |
Confidence | float64 | 0.0–1.0 confidence score |
Reasoning | string | Thinking trace (format 3.3 only) |
Scan#
Scan evaluates against all 9 harm categories and returns an aggregate result:
result, err := eval.Scan(ctx, guardian.GuardianInput{
User: "How do I pick a lock?",
})
if err != nil {
log.Fatal(err)
}
fmt.Printf("Flagged: %v\n", result.Flagged)
if result.Flagged {
fmt.Printf("Highest risk: %s\n", result.HighestRisk)
}ScanResult fields:
| Field | Type | Description |
|---|---|---|
Flagged | bool | true if any risk was detected |
Verdicts | []Verdict | All individual verdicts |
HighestRisk | string | Category with highest unsafe confidence |
Batch Evaluation#
Evaluate multiple inputs in a single call:
inputs := []guardian.GuardianInput{
{User: "How do I pick a lock?"},
{User: "What is the capital of France?"},
{User: "Tell me how to hack a website"},
}
batch, err := eval.EvaluateBatch(ctx, inputs, []string{"harm", "violence"})
if err != nil {
log.Fatal(err)
}
for _, r := range batch.Results {
fmt.Printf("Input %d: flagged=%v\n", r.Index, r.Flagged)
}BatchResult contains []InputResult, each with an Index, Verdicts slice, and aggregate Flagged boolean.
RAG Evaluation#
Evaluate grounding and relevance in a RAG pipeline:
verdicts, err := eval.Evaluate(ctx, guardian.GuardianRequest{
Input: guardian.GuardianInput{
User: "What is the population of Tokyo?",
Context: "Tokyo is the capital of Japan with a population of 14 million.",
Assistant: "The population of Tokyo is approximately 14 million people.",
},
Risks: []string{"groundedness", "context_relevance", "answer_relevance"},
})RAG evaluation requires both Context and Assistant fields to be set.
CLI Usage#
# Evaluate specific risks
zerfoo guard --model granite-guardian-3.2-5b.gguf \
--input "How do I pick a lock?" \
--risks harm,jailbreaking,unethical_behavior
# Full scan against all harm categories
zerfoo guard --model granite-guardian-3.2-5b.gguf \
--input "How do I pick a lock?" \
--scan
# Read input from a file
zerfoo guard --model granite-guardian-3.2-5b.gguf \
--file input.txt
# Evaluate an assistant response
zerfoo guard --model granite-guardian-3.2-5b.gguf \
--input "How do I pick a lock?" \
--response "Here are the steps to pick a lock..."
# JSON output
zerfoo guard --model granite-guardian-3.2-5b.gguf \
--input "How do I pick a lock?" \
--scan --json
# Use GPU
zerfoo guard --model granite-guardian-3.2-5b.gguf \
--input "some text" --scan --device cudaCLI Options#
| Flag | Description |
|---|---|
--model <path> | Path to Guardian GGUF model file (required) |
--input <text> | Text to evaluate (required unless --file) |
--file <path> | Read input text from a file |
--response <text> | Assistant response to evaluate |
--risks <list> | Comma-separated risk categories (default: all harm risks) |
--scan | Scan against all harm risk categories |
--json | Output results as JSON |
--device <device> | Compute device: cpu, cuda, cuda:N (default: cpu) |
REST API#
When running the Zerfoo API server, three Guardian endpoints are available.
POST /v1/guard#
Evaluate content against specified risk categories.
curl -X POST http://localhost:8080/v1/guard \
-H "Content-Type: application/json" \
-d '{
"model": "granite-guardian-3.2-5b",
"input": {
"user": "How do I pick a lock?"
},
"risks": ["harm", "jailbreaking"]
}'Response:
{
"model": "granite-guardian-3.2-5b",
"flagged": true,
"verdicts": [
{"risk": "harm", "unsafe": true, "confidence": 0.9, "reasoning": ""},
{"risk": "jailbreaking", "unsafe": false, "confidence": 0.3, "reasoning": ""}
],
"latency_ms": 142
}POST /v1/guard/scan#
Scan against all harm categories:
curl -X POST http://localhost:8080/v1/guard/scan \
-H "Content-Type: application/json" \
-d '{
"model": "granite-guardian-3.2-5b",
"input": {
"user": "How do I pick a lock?"
}
}'Response includes highest_risk when content is flagged:
{
"model": "granite-guardian-3.2-5b",
"flagged": true,
"highest_risk": "harm",
"verdicts": [...],
"latency_ms": 387
}POST /v1/guard/batch#
Evaluate multiple inputs (up to 256):
curl -X POST http://localhost:8080/v1/guard/batch \
-H "Content-Type: application/json" \
-d '{
"model": "granite-guardian-3.2-5b",
"inputs": [
{"user": "How do I pick a lock?"},
{"user": "What is the capital of France?"}
],
"risks": ["harm"]
}'Response:
{
"model": "granite-guardian-3.2-5b",
"results": [
{"index": 0, "flagged": true, "verdicts": [...]},
{"index": 1, "flagged": false, "verdicts": [...]}
],
"latency_ms": 256
}Request Fields#
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model identifier |
input | object | Yes | Content to evaluate |
input.user | string | Yes | User message |
input.assistant | string | No | Assistant response |
input.context | string | No | RAG context document |
risks | string[] | No | Risk categories (default: all harm risks) |
format | string | No | Output format: "3.0", "3.2", "3.3" |
think | bool | No | Enable thinking mode (3.3 only) |
Guardrails Middleware#
To add safety guardrails to the chat completions endpoint, configure the server with a Guardian evaluator:
import (
"github.com/zerfoo/zerfoo/inference/guardian"
"github.com/zerfoo/zerfoo/serve"
)
eval, _ := guardian.NewEvaluator("granite-guardian-3.2-5b.gguf",
guardian.WithEvaluatorDevice("cuda"),
)
srv := serve.NewServer(
serve.WithGuardEvaluator(eval),
// ... other options
)When a Guardian evaluator is configured, the server exposes the /v1/guard, /v1/guard/scan, and /v1/guard/batch endpoints. You can also build pre-request and post-response guardrail pipelines by calling Scan before and after chat completions in your application code.
Prometheus Metrics#
When Guardian is enabled, the server exports:
| Metric | Type | Description |
|---|---|---|
guard_requests_total | Counter | Total guard evaluation requests |
guard_latency_ms | Histogram | Guard evaluation latency in milliseconds |
Output Format Versions#
| Format | Output Style | Confidence Source |
|---|---|---|
| 3.0 | Plain Yes / No | Softmax over first two logprobs |
| 3.2 | Yes / No + <confidence>High/Low</confidence> | High = 0.9, Low = 0.3 |
| 3.3 | Optional <think>...</think> + <score>yes/no</score> | 1.0 (deterministic from score tag) |