Granite Guardian#

Granite Guardian is an AI safety and content moderation system built on IBM’s Granite Guardian model family. It evaluates text for safety risks across 13 predefined categories, covering harmful user messages, problematic assistant responses, and RAG pipeline quality.

Model Variants#

ModelParametersNotes
Granite Guardian 3.02B, 8BPlain Yes/No output, logprob-based confidence
Granite Guardian 3.25BYes/No with <confidence>High/Low</confidence> tags
Granite Guardian 3.38BOptional <think> reasoning traces, <score> tags

All variants use GGUF format and are loaded through the standard inference.LoadFile pipeline.

Risk Categories#

Guardian evaluates content against 13 risk categories organized into three groups.

Harm Categories (9)#

These categories evaluate user messages and assistant responses for harmful content:

CategoryDescription
harmHarmful, offensive, or inappropriate content
social_biasPrejudice based on race, gender, religion, or other protected characteristics
jailbreakingAttempts to bypass AI safety guidelines or manipulate the system
violencePromotion, glorification, or incitement of violence or physical harm
profanityVulgar language, obscenities, or crude expressions
sexual_contentSexually explicit content or sexualized references
unethical_behaviorInstructions or encouragement for fraud, deception, or manipulation
harm_engagementAssistant responses that engage with harmful content instead of refusing
evasivenessAssistant responses that unnecessarily refuse legitimate questions

RAG Categories (3)#

These categories evaluate retrieval-augmented generation pipeline quality:

CategoryDescription
context_relevanceWhether the retrieved context is relevant to the user’s question
groundednessWhether the assistant’s response is supported by the provided context
answer_relevanceWhether the assistant’s response addresses the user’s question

Function Calling (1)#

CategoryDescription
function_call_hallucinationWhether the assistant invoked a function that does not exist or used incorrect parameters

Go API#

Creating an Evaluator#

import "github.com/zerfoo/zerfoo/inference/guardian"

// Load a Guardian model from a GGUF file.
eval, err := guardian.NewEvaluator("granite-guardian-3.2-5b.gguf",
    guardian.WithEvaluatorDevice("cuda"),
    guardian.WithDefaultFormat("3.2"),
)
if err != nil {
    log.Fatal(err)
}

Options:

OptionDescription
WithEvaluatorDevice(device)Compute device: "cpu", "cuda", "cuda:0"
WithDefaultFormat(format)Output format: "3.0", "3.2", "3.3"
WithLoadOptions(opts...)Pass additional inference.Option values to the model loader

You can also wrap a pre-loaded model with NewEvaluatorFromModel:

model, _ := inference.LoadFile("granite-guardian-3.2-5b.gguf",
    inference.WithDevice("cuda"),
)
eval := guardian.NewEvaluatorFromModel(model,
    guardian.WithDefaultFormat("3.2"),
)

Evaluate#

Evaluate specific risk categories:

verdicts, err := eval.Evaluate(ctx, guardian.GuardianRequest{
    Input: guardian.GuardianInput{
        User: "How do I pick a lock?",
    },
    Risks: []string{"harm", "jailbreaking", "unethical_behavior"},
})
if err != nil {
    log.Fatal(err)
}

for _, v := range verdicts {
    fmt.Printf("%-25s unsafe=%-5v confidence=%.2f\n",
        v.Risk, v.Unsafe, v.Confidence)
}

When Risks is empty, all 9 harm categories are evaluated by default.

Each Verdict contains:

FieldTypeDescription
Unsafebooltrue if the model flagged a risk
RiskstringThe risk category name
Confidencefloat640.0–1.0 confidence score
ReasoningstringThinking trace (format 3.3 only)

Scan#

Scan evaluates against all 9 harm categories and returns an aggregate result:

result, err := eval.Scan(ctx, guardian.GuardianInput{
    User: "How do I pick a lock?",
})
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Flagged: %v\n", result.Flagged)
if result.Flagged {
    fmt.Printf("Highest risk: %s\n", result.HighestRisk)
}

ScanResult fields:

FieldTypeDescription
Flaggedbooltrue if any risk was detected
Verdicts[]VerdictAll individual verdicts
HighestRiskstringCategory with highest unsafe confidence

Batch Evaluation#

Evaluate multiple inputs in a single call:

inputs := []guardian.GuardianInput{
    {User: "How do I pick a lock?"},
    {User: "What is the capital of France?"},
    {User: "Tell me how to hack a website"},
}

batch, err := eval.EvaluateBatch(ctx, inputs, []string{"harm", "violence"})
if err != nil {
    log.Fatal(err)
}

for _, r := range batch.Results {
    fmt.Printf("Input %d: flagged=%v\n", r.Index, r.Flagged)
}

BatchResult contains []InputResult, each with an Index, Verdicts slice, and aggregate Flagged boolean.

RAG Evaluation#

Evaluate grounding and relevance in a RAG pipeline:

verdicts, err := eval.Evaluate(ctx, guardian.GuardianRequest{
    Input: guardian.GuardianInput{
        User:      "What is the population of Tokyo?",
        Context:   "Tokyo is the capital of Japan with a population of 14 million.",
        Assistant: "The population of Tokyo is approximately 14 million people.",
    },
    Risks: []string{"groundedness", "context_relevance", "answer_relevance"},
})

RAG evaluation requires both Context and Assistant fields to be set.

CLI Usage#

# Evaluate specific risks
zerfoo guard --model granite-guardian-3.2-5b.gguf \
    --input "How do I pick a lock?" \
    --risks harm,jailbreaking,unethical_behavior

# Full scan against all harm categories
zerfoo guard --model granite-guardian-3.2-5b.gguf \
    --input "How do I pick a lock?" \
    --scan

# Read input from a file
zerfoo guard --model granite-guardian-3.2-5b.gguf \
    --file input.txt

# Evaluate an assistant response
zerfoo guard --model granite-guardian-3.2-5b.gguf \
    --input "How do I pick a lock?" \
    --response "Here are the steps to pick a lock..."

# JSON output
zerfoo guard --model granite-guardian-3.2-5b.gguf \
    --input "How do I pick a lock?" \
    --scan --json

# Use GPU
zerfoo guard --model granite-guardian-3.2-5b.gguf \
    --input "some text" --scan --device cuda

CLI Options#

FlagDescription
--model <path>Path to Guardian GGUF model file (required)
--input <text>Text to evaluate (required unless --file)
--file <path>Read input text from a file
--response <text>Assistant response to evaluate
--risks <list>Comma-separated risk categories (default: all harm risks)
--scanScan against all harm risk categories
--jsonOutput results as JSON
--device <device>Compute device: cpu, cuda, cuda:N (default: cpu)

REST API#

When running the Zerfoo API server, three Guardian endpoints are available.

POST /v1/guard#

Evaluate content against specified risk categories.

curl -X POST http://localhost:8080/v1/guard \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-guardian-3.2-5b",
    "input": {
      "user": "How do I pick a lock?"
    },
    "risks": ["harm", "jailbreaking"]
  }'

Response:

{
  "model": "granite-guardian-3.2-5b",
  "flagged": true,
  "verdicts": [
    {"risk": "harm", "unsafe": true, "confidence": 0.9, "reasoning": ""},
    {"risk": "jailbreaking", "unsafe": false, "confidence": 0.3, "reasoning": ""}
  ],
  "latency_ms": 142
}

POST /v1/guard/scan#

Scan against all harm categories:

curl -X POST http://localhost:8080/v1/guard/scan \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-guardian-3.2-5b",
    "input": {
      "user": "How do I pick a lock?"
    }
  }'

Response includes highest_risk when content is flagged:

{
  "model": "granite-guardian-3.2-5b",
  "flagged": true,
  "highest_risk": "harm",
  "verdicts": [...],
  "latency_ms": 387
}

POST /v1/guard/batch#

Evaluate multiple inputs (up to 256):

curl -X POST http://localhost:8080/v1/guard/batch \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-guardian-3.2-5b",
    "inputs": [
      {"user": "How do I pick a lock?"},
      {"user": "What is the capital of France?"}
    ],
    "risks": ["harm"]
  }'

Response:

{
  "model": "granite-guardian-3.2-5b",
  "results": [
    {"index": 0, "flagged": true, "verdicts": [...]},
    {"index": 1, "flagged": false, "verdicts": [...]}
  ],
  "latency_ms": 256
}

Request Fields#

FieldTypeRequiredDescription
modelstringYesModel identifier
inputobjectYesContent to evaluate
input.userstringYesUser message
input.assistantstringNoAssistant response
input.contextstringNoRAG context document
risksstring[]NoRisk categories (default: all harm risks)
formatstringNoOutput format: "3.0", "3.2", "3.3"
thinkboolNoEnable thinking mode (3.3 only)

Guardrails Middleware#

To add safety guardrails to the chat completions endpoint, configure the server with a Guardian evaluator:

import (
    "github.com/zerfoo/zerfoo/inference/guardian"
    "github.com/zerfoo/zerfoo/serve"
)

eval, _ := guardian.NewEvaluator("granite-guardian-3.2-5b.gguf",
    guardian.WithEvaluatorDevice("cuda"),
)

srv := serve.NewServer(
    serve.WithGuardEvaluator(eval),
    // ... other options
)

When a Guardian evaluator is configured, the server exposes the /v1/guard, /v1/guard/scan, and /v1/guard/batch endpoints. You can also build pre-request and post-response guardrail pipelines by calling Scan before and after chat completions in your application code.

Prometheus Metrics#

When Guardian is enabled, the server exports:

MetricTypeDescription
guard_requests_totalCounterTotal guard evaluation requests
guard_latency_msHistogramGuard evaluation latency in milliseconds

Output Format Versions#

FormatOutput StyleConfidence Source
3.0Plain Yes / NoSoftmax over first two logprobs
3.2Yes / No + <confidence>High/Low</confidence>High = 0.9, Low = 0.3
3.3Optional <think>...</think> + <score>yes/no</score>1.0 (deterministic from score tag)