ztensor#
GPU-accelerated tensor, compute engine, and computation graph library for Go. Current version: v0.15.0.
go get github.com/zerfoo/ztensorOverview#
ztensor is the foundational tensor and compute library in the Zerfoo ecosystem. It provides multi-type tensor storage, a unified compute engine interface across CPU and GPU backends, a computation graph compiler with operator fusion, and GPU memory management – all without CGo.
If you are building an ML inference engine, need GPU compute from Go, or want a typed tensor library, ztensor is the package to import.
When to Use ztensor Directly#
| Use case | Import |
|---|---|
| Tensor math, GPU compute, custom ML operators | github.com/zerfoo/ztensor directly |
| Transformer inference, model serving, training | github.com/zerfoo/zerfoo (imports ztensor internally) |
Import ztensor directly when you need tensor operations or GPU compute without the full inference/serving stack. If you are running transformer models, use zerfoo – it builds on ztensor for you.
Tensor Creation#
Tensors are generic over all numeric types via the tensor.Numeric constraint:
import "github.com/zerfoo/ztensor/tensor"
// Create a 2x3 float32 tensor
a, _ := tensor.New[float32]([]int{2, 3}, []float32{1, 2, 3, 4, 5, 6})
fmt.Println(a.Shape()) // [2, 3]
fmt.Println(a.Data()) // [1 2 3 4 5 6]Supported element types include float32, float64, float16.Float16, float16.BFloat16, float8.Float8, and all Go integer types.
Compute Engine#
All arithmetic flows through the compute.Engine[T] interface. This enables transparent CPU/GPU switching and CUDA graph capture.
CPU Engine#
import (
"context"
"github.com/zerfoo/ztensor/compute"
"github.com/zerfoo/ztensor/numeric"
"github.com/zerfoo/ztensor/tensor"
)
ctx := context.Background()
eng := compute.NewCPUEngine[float32](numeric.Float32Ops{})
a, _ := tensor.New[float32]([]int{2, 3}, []float32{1, 2, 3, 4, 5, 6})
b, _ := tensor.New[float32]([]int{3, 2}, []float32{1, 2, 3, 4, 5, 6})
c, _ := eng.MatMul(ctx, a, b)
fmt.Println(c.Shape()) // [2, 2]
fmt.Println(c.Data()) // [22 28 49 64]GPU Engine#
GPU libraries are loaded at runtime via purego – no CGo, no build tags, no linking. If the GPU runtime is not available, the constructor returns an error and you fall back to CPU.
// CUDA (NVIDIA GPUs)
eng, err := compute.NewGPUEngine[float32](numeric.Float32Ops{})
// ROCm (AMD GPUs)
eng, err := compute.NewROCmEngine[float32](numeric.Float32Ops{})
// OpenCL (cross-vendor)
eng, err := compute.NewOpenCLEngine[float32](numeric.Float32Ops{})A common pattern is to try GPU first with a CPU fallback:
eng, err := compute.NewGPUEngine[float32](numeric.Float32Ops{})
if err != nil {
eng = compute.NewCPUEngine[float32](numeric.Float32Ops{})
}Type-Safe Generics#
Write functions that work across any numeric type:
func dotProduct[T tensor.Numeric](
eng compute.Engine[T],
a, b *tensor.TensorNumeric[T],
) (*tensor.TensorNumeric[T], error) {
return eng.MatMul(context.Background(), a, b)
}Computation Graph#
The graph package provides a computation graph compiler with operator fusion passes and CUDA graph capture for optimized inference:
| Feature | Description |
|---|---|
| Operator fusion | Combines adjacent operations to reduce kernel launches |
| CUDA graph capture | Records and replays GPU execution for minimal launch overhead |
| Megakernel codegen | Generates fused GPU kernels at compile time |
Package Reference#
| Package | Description |
|---|---|
tensor/ | Multi-type tensor storage (CPU, GPU, quantized) |
compute/ | Engine interface with CPU, CUDA, ROCm, and OpenCL implementations |
graph/ | Computation graph compiler with fusion and CUDA graph capture |
numeric/ | Type-safe Arithmetic[T] interface for all numeric types |
device/ | Device abstraction and memory allocators |
internal/cuda/ | Zero-CGo CUDA runtime bindings via purego, 25+ custom kernels |
internal/xblas/ | ARM NEON and x86 AVX2 SIMD assembly |
internal/gpuapi/ | GPU Runtime Abstraction Layer (CUDA/ROCm/OpenCL) |
internal/codegen/ | Megakernel code generator |
What’s New in v0.15.0#
MmapStorage.SliceElements#
MmapStorage.SliceElements provides zero-copy slicing of mmap’d tensor elements. It returns a view into the memory-mapped region without copying data, making expert weight extraction in mixture-of-experts models efficient:
// Extract expert weights directly from the mmap'd file — no allocation
expertWeights, err := mmapStorage.SliceElements(expertOffset, expertSize)This replaces the previous pattern of copying expert weights into a new tensor before each forward pass.
Streaming GEMM for mmap’d Tensors#
internal/xblas now includes a streaming GEMM path for mmap’d weight tensors. Instead of paging in the entire weight matrix before computation, the kernel tiles over the mmap region in cache-sized chunks, keeping memory bandwidth proportional to the active tile rather than the full matrix.
This enables over-RAM CPU inference: a model whose weights exceed physical RAM can run without GPU, with the OS paging tensor data from NVMe on demand. Combined with MmapStorage.SliceElements, a 229B MoE model runs on a 128 GB machine with no configuration flags.
Dependencies#
ztensor depends on float16 and float8 for half-precision and FP8 arithmetic.