Vision / Multimodal Inference#
Analyze images using a vision-capable GGUF model. The image is passed alongside a text prompt using the inference.Message API, the same format used by the OpenAI-compatible /v1/chat/completions endpoint.
Requirements#
- A vision-capable GGUF model (e.g. LLaVA, Gemma 3 with vision encoder)
Full Example#
// Recipe 12: Vision / Multimodal Inference
//
// Analyze images using a vision-capable GGUF model. The image is passed
// alongside a text prompt using the inference.Message API, the same format
// used by the OpenAI-compatible /v1/chat/completions endpoint.
//
// Requirements:
// - A vision-capable GGUF model (e.g. LLaVA, Gemma 3 with vision encoder)
//
// Usage:
//
// go run ./docs/cookbook/12-vision-multimodal/ --model path/to/vision-model.gguf --image photo.jpg
// go run ./docs/cookbook/12-vision-multimodal/ --model path/to/vision-model.gguf --image photo.jpg --prompt "Count the objects"
package main
import (
"context"
"flag"
"fmt"
"os"
"github.com/zerfoo/zerfoo/inference"
)
func main() {
modelPath := flag.String("model", "", "path to a vision-capable GGUF model")
device := flag.String("device", "cpu", `compute device: "cpu", "cuda"`)
imagePath := flag.String("image", "", "path to an image file (JPEG or PNG)")
prompt := flag.String("prompt", "Describe this image in detail.", "question about the image")
maxTokens := flag.Int("max-tokens", 512, "maximum tokens to generate")
flag.Parse()
if *modelPath == "" || *imagePath == "" {
fmt.Fprintln(os.Stderr, "usage: vision-multimodal --model <model.gguf> --image <image.jpg>")
os.Exit(1)
}
// Read the image file into memory.
imageData, err := os.ReadFile(*imagePath)
if err != nil {
fmt.Fprintf(os.Stderr, "read image: %v\n", err)
os.Exit(1)
}
fmt.Fprintf(os.Stderr, "Image: %s (%d bytes)\n", *imagePath, len(imageData))
// Load the vision-capable model.
model, err := inference.LoadFile(*modelPath, inference.WithDevice(*device))
if err != nil {
fmt.Fprintf(os.Stderr, "load: %v\n", err)
os.Exit(1)
}
defer model.Close()
cfg := model.Config()
fmt.Fprintf(os.Stderr, "Model: %s (%d layers)\n", cfg.Architecture, cfg.NumLayers)
// Build a chat message with the image embedded.
// The Images field carries raw image bytes, matching the OpenAI vision API.
messages := []inference.Message{
{
Role: "user",
Content: *prompt,
Images: [][]byte{imageData},
},
}
resp, err := model.Chat(context.Background(), messages,
inference.WithMaxTokens(*maxTokens),
inference.WithTemperature(0.5),
)
if err != nil {
fmt.Fprintf(os.Stderr, "generate: %v\n", err)
os.Exit(1)
}
fmt.Println(resp.Content)
}How It Works#
Read the image – The image file (JPEG or PNG) is read into a byte slice with
os.ReadFile. No preprocessing is needed – the model handles decoding and resizing internally.Load a vision model – Use
inference.LoadFilewith a vision-capable GGUF model. Models like LLaVA and Gemma 3 with vision encoders include both a vision tower (image encoder) and a language model in a single GGUF file.Build a multimodal message – The
inference.Messagestruct accepts both text (Content) and images (Imagesas raw bytes). This mirrors the OpenAI chat completions API where images are passed inline.Generate –
model.Chatprocesses the image through the vision encoder, projects the image embeddings into the language model’s space, and generates a text response conditioned on both the image and the text prompt.
Supported Vision Models#
| Model | Architecture | Notes |
|---|---|---|
| LLaVA | LLaVA | CLIP vision encoder + Llama language model |
| Gemma 3 (with vision) | Gemma 3 | SigLIP vision encoder + Gemma language model |
Use Cases#
- Image captioning: “Describe this image in detail.”
- Visual question answering: “How many people are in this photo?”
- Document understanding: “Extract the text from this receipt.”
- Object detection: “List all objects visible in this image.”
Related API Reference#
- Inference API –
inference.LoadFile,inference.Message, andmodel.Chat - Serve API – the
/v1/chat/completionsendpoint accepts images via the OpenAI vision API format