Engine[T] API Freeze for v1.0#
Status: Frozen Date: 2026-03-18 ADR: Supplements ADR-037 (GGUF sole format) and ADR-044+ (v1.0 roadmap)
Overview#
compute.Engine[T tensor.Numeric] is the central compute abstraction in
github.com/zerfoo/ztensor/compute. Every tensor arithmetic operation in the
Zerfoo ecosystem flows through this interface, enabling transparent CPU/GPU
switching and CUDA graph capture. This document catalogs the frozen v1.0
surface and defines extension interfaces for specialized hardware capabilities.
Core Interface — Engine[T tensor.Numeric]#
All implementations must provide every method below. The interface is
defined in compute/engine.go.
Type Metadata#
| # | Method | Signature | Purpose |
|---|---|---|---|
| 1 | Ops | () numeric.Arithmetic[T] | Returns type-specific arithmetic operations |
Element-wise Arithmetic#
| # | Method | Signature | Purpose |
|---|---|---|---|
| 2 | Add | (ctx, a, b, dst...) (*Tensor, error) | Element-wise addition with broadcasting |
| 3 | Sub | (ctx, a, b, dst...) (*Tensor, error) | Element-wise subtraction with broadcasting |
| 4 | Mul | (ctx, a, b, dst...) (*Tensor, error) | Element-wise multiplication with broadcasting |
| 5 | Div | (ctx, a, b, dst...) (*Tensor, error) | Element-wise division with broadcasting |
| 6 | Pow | (ctx, base, exp, dst...) (*Tensor, error) | Element-wise power |
| 7 | MulScalar | (ctx, a, scalar, dst...) (*Tensor, error) | Multiply tensor by scalar |
| 8 | DivScalar | (ctx, a, scalar, dst...) (*Tensor, error) | Divide tensor by scalar |
| 9 | AddScalar | (ctx, a, scalar, dst...) (*Tensor, error) | Add scalar to tensor |
Unary Math#
| # | Method | Signature | Purpose |
|---|---|---|---|
| 10 | UnaryOp | (ctx, a, func(T)T, dst...) (*Tensor, error) | Apply arbitrary unary function |
| 11 | Exp | (ctx, a, dst...) (*Tensor, error) | Element-wise exponential |
| 12 | Log | (ctx, a, dst...) (*Tensor, error) | Element-wise natural logarithm |
| 13 | Sin | (ctx, a, dst...) (*Tensor, error) | Element-wise sine |
| 14 | Cos | (ctx, a, dst...) (*Tensor, error) | Element-wise cosine |
| 15 | Tanh | (ctx, a, dst...) (*Tensor, error) | Element-wise hyperbolic tangent |
| 16 | TanhPrime | (ctx, a, upstream, dst...) (*Tensor, error) | Tanh gradient for backprop |
| 17 | Sqrt | (ctx, a, dst...) (*Tensor, error) | Element-wise square root |
| 18 | Rsqrt | (ctx, a, dst...) (*Tensor, error) | Element-wise reciprocal square root |
Linear Algebra#
| # | Method | Signature | Purpose |
|---|---|---|---|
| 19 | MatMul | (ctx, a, b, dst...) (*Tensor, error) | Matrix multiplication (2D+) |
Reduction#
| # | Method | Signature | Purpose |
|---|---|---|---|
| 20 | Sum | (ctx, a, axis, keepDims, dst...) (*Tensor, error) | Sum along axis |
| 21 | ReduceSum | (ctx, a, axis, keepDims, dst...) (*Tensor, error) | Reduction sum (optimized path) |
| 22 | ReduceMean | (ctx, a, axis, keepDims, dst...) (*Tensor, error) | Mean along axis |
| 23 | Softmax | (ctx, a, axis, dst...) (*Tensor, error) | Softmax along axis |
Shape Manipulation#
| # | Method | Signature | Purpose |
|---|---|---|---|
| 24 | Transpose | (ctx, a, axes, dst...) (*Tensor, error) | Transpose along given axes |
| 25 | Reshape | (ctx, a, shape, dst...) (*Tensor, error) | Reshape without copying data |
| 26 | Split | (ctx, a, numSplits, axis) ([]*Tensor, error) | Split tensor along axis |
| 27 | Concat | (ctx, tensors, axis, dst...) (*Tensor, error) | Concatenate tensors along axis |
| 28 | Repeat | (ctx, a, axis, reps, dst...) (*Tensor, error) | Repeat tensor along axis |
Embedding / Scatter#
| # | Method | Signature | Purpose |
|---|---|---|---|
| 29 | Gather | (ctx, params, indices, output) error | Embedding-style gather (2D params, 1D/2D indices) |
| 30 | ScatterAdd | (ctx, dEmbedTable, indices, dOut) error | Scatter-add for embedding gradients |
Initialization / Memory#
| # | Method | Signature | Purpose |
|---|---|---|---|
| 31 | Zero | (ctx, a) error | Set all elements to zero |
| 32 | Zeros | (ctx, a, shape) error | Fill with zeros, optionally reallocate shape |
| 33 | Copy | (ctx, dst, src) error | Copy data between tensors |
| 34 | Fill | (ctx, t, value) error | Fill tensor with scalar |
| 35 | RandomUniform | (ctx, t, min, max) error | Fill with uniform random values |
| 36 | OneHot | (ctx, input, depth, dst...) (*Tensor, error) | One-hot encoding |
Total: 36 methods in the core interface.
Extension Interfaces#
Extension interfaces are optional. Callers use Go type assertions
(eng.(FusedRoPEProvider[T])) to check availability at runtime. All extension
interfaces are defined in the compute/ package alongside Engine[T].
Fused Kernel Providers#
These eliminate kernel-launch overhead by combining multiple operations.
| Interface | Method | Purpose |
|---|---|---|
FusedRMSNormer | FusedRMSNormGPU(input, weight, eps) | Fused RMSNorm on GPU (float32 only) |
FusedAddRMSNormProvider[T] | GPUFusedAddRMSNorm(input, residual, weight, eps) | Fused residual-add + RMSNorm |
FusedNormAddProvider[T] | GPUFusedNormAdd(input, weight, residual, eps) | Fused RMSNorm + element-wise add |
FusedScaledSoftmaxProvider[T] | GPUScaledSoftmax(input, scale, axis) | Fused scale + softmax |
FusedRoPEProvider[T] | GPUFusedRoPE(input, cos, sin, rotaryDim) | Fused rotary position embedding |
FusedSwiGLUProvider[T] | GPUFusedSwiGLU(w1, w3) | Fused SwiGLU activation |
FusedQKNormRoPEProvider[T] | GPUFusedQKNormRoPE(input, wQ, wK, cos, sin, eps, ...) | Fused QK-norm + RoPE (4 kernels -> 1) |
Mixed-Precision / Quantized#
| Interface | Method | Purpose |
|---|---|---|
FP16ToF32Converter | ConvertFP16ToF32(t) | Convert Float16Storage tensor to float32 GPU tensor |
TransposeBMatMuler[T] | MatMulTransposeB(ctx, a, b, dst...) | C = A * B^T without explicit transpose allocation |
W4A16MatMuler[T] | MatMulW4A16(ctx, a, b, dst...) | Mixed 4-bit weight / FP16 activation MatMul |
GPU Infrastructure#
| Interface | Method(s) | Purpose |
|---|---|---|
StreamProvider | Stream() unsafe.Pointer | Expose GPU stream for CUDA graph capture |
GPUStreamAccessor | GPUStream() gpuapi.Stream | Typed GPU stream for async memory ops |
PoolResetter | ResetPool() | O(1) arena reset between forward passes |
WeightUploader | UploadWeights([]*Tensor) error | Pre-upload weights to device memory |
GPUArgmaxer | GPUArgmax(t) (int, error) | GPU-side argmax (avoids ~1MB D2H copy per token) |
Paged Attention#
| Interface | Method(s) | Purpose |
|---|---|---|
PagedGQAer | PagedGQA(Q, blockPtrsK, blockPtrsV, blockIndices, ...) | Paged grouped-query attention via block-table indirection |
IsPagedGQASupported() bool | Runtime availability check |
Proposed Extension Interfaces for v1.0#
The following extension interfaces are recommended for formalization before the v1.0 freeze. They capture patterns already present in the codebase as unexported methods or ad-hoc type assertions.
EngineWithFP8#
FP8 E4M3 support is currently handled inside GPUEngine.MatMul via storage
type inspection. Formalizing it as an extension interface makes FP8 capability
discoverable and testable.
// EngineWithFP8 is an optional interface for engines that support
// FP8 E4M3FN quantized matrix multiplication.
type EngineWithFP8 interface {
// MatMulFP8 performs C = dequant(A_fp8) * B_f32 or C = A_f32 * dequant(B_fp8).
// The FP8 operand is identified by its FP8E4M3Storage.
MatMulFP8(ctx context.Context, a, b *tensor.TensorNumeric[float32],
dst ...*tensor.TensorNumeric[float32]) (*tensor.TensorNumeric[float32], error)
// IsFP8Supported returns true if the hardware supports FP8 operations.
IsFP8Supported() bool
}EngineWithPagedKV#
Paged KV cache management is tightly coupled to the attention kernel. An extension interface makes the capability explicit for serving-layer code.
// EngineWithPagedKV extends PagedGQAer with KV cache block management.
type EngineWithPagedKV interface {
PagedGQAer
// AllocKVBlocks allocates N blocks of [blockSize, headDim] on device.
AllocKVBlocks(n, blockSize, headDim, numKVHeads int) (blockPtrs unsafe.Pointer, err error)
// FreeKVBlocks releases previously allocated KV blocks.
FreeKVBlocks(blockPtrs unsafe.Pointer, n int) error
// AppendKVToken appends a single token's K and V vectors to the paged cache.
AppendKVToken(blockPtrs unsafe.Pointer, blockIndices unsafe.Pointer,
k, v *tensor.TensorNumeric[float32], seqPos, blockSize, headDim int) error
}EngineWithBF16#
BFloat16 is used in some model weights. A formal extension makes BF16 capability discoverable.
// EngineWithBF16 is an optional interface for engines that support
// BFloat16 arithmetic or mixed BF16/F32 operations.
type EngineWithBF16 interface {
// MatMulBF16 performs matrix multiplication with BF16 inputs.
MatMulBF16(ctx context.Context, a, b *tensor.TensorNumeric[float32],
dst ...*tensor.TensorNumeric[float32]) (*tensor.TensorNumeric[float32], error)
// IsBF16Supported returns true if the hardware supports BF16 operations.
IsBF16Supported() bool
}Implementation Matrix#
| Capability | CPUEngine | GPUEngine | ROCmEngine | OpenCLEngine |
|---|---|---|---|---|
| Core Engine[T] | Full | Full | Partial | Partial |
| FusedRMSNormer | – | Yes | – | – |
| FusedAddRMSNormProvider | – | Yes | – | – |
| FusedNormAddProvider | – | Yes | – | – |
| FusedScaledSoftmaxProvider | – | Yes | – | – |
| FusedRoPEProvider | – | Yes | – | – |
| FusedSwiGLUProvider | – | Yes | – | – |
| FusedQKNormRoPEProvider | – | Yes | – | – |
| FP16ToF32Converter | – | Yes | – | – |
| TransposeBMatMuler | – | Yes | – | – |
| W4A16MatMuler | – | Yes (planned) | – | – |
| StreamProvider | – | Yes | – | – |
| GPUStreamAccessor | – | Yes | – | – |
| PoolResetter | – | Yes | – | – |
| WeightUploader | – | Yes | – | – |
| GPUArgmaxer | – | Yes | – | – |
| PagedGQAer | – | Yes | – | – |
| EngineWithFP8 (proposed) | – | Implicit | – | – |
| EngineWithPagedKV (proposed) | – | Partial | – | – |
| EngineWithBF16 (proposed) | – | Implicit | – | – |
Recommendations Before v1.0 Freeze#
1. Consolidate Sum and ReduceSum#
Sum and ReduceSum have identical signatures and overlapping semantics.
For v1.0 clarity, either:
- Option A: Remove
ReduceSum, keepSum(fewer methods, simpler). - Option B: Differentiate clearly in docs.
Recommendation: Option A. One method, one name.
2. Make FusedRMSNormer Generic#
FusedRMSNormer is the only fused interface that uses concrete float32
instead of the generic [T tensor.Numeric] pattern used by all other fused
providers. This inconsistency should be resolved.
3. Formalize FP8 as an Extension Interface#
FP8 MatMul dispatch currently happens via storage-type inspection deep inside
GPUEngine.MatMul. Extracting it into EngineWithFP8 would make FP8
capability explicitly discoverable.
4. Add Close() error to Engine[T]#
GPU engines hold device memory (arena, scratchpads, BLAS handles). There is no
standard way to release these resources. Adding Close() error to the core
interface would enable deterministic cleanup.
5. Standardize dst ... Optional Output Pattern#
All 21 methods that return (*Tensor, error) accept a variadic dst parameter
for in-place output. The convention should be documented:
- If
dst[0]is non-nil, the result is written into it (shape must match). - If
dstis empty ordst[0]is nil, a new tensor is allocated. - Implementations must not read from
dst[0](it may contain stale data).
Compatibility Contract#
Once frozen at v1.0:
- No method removals from
Engine[T]. - No signature changes to existing methods.
- New methods require a new extension interface (not added to core).
- Extension interfaces may be added freely in minor versions.
- Behavioral changes (e.g., broadcasting rules) require an ADR.
EngineProxy Compatibility#
EngineProxy[T] wraps Engine[T] and forwards all 36 core methods. It also
exposes proxy methods for key extension interfaces:
FusedRMSNormGPU— delegates toFusedRMSNormerGPUFusedAddRMSNorm— delegates toFusedAddRMSNormProvider[T]MatMulTransposeB— delegates toTransposeBMatMuler[T](with fallback)ResetPool— delegates toPoolResetterArenaUsedBytes/SetArenaResetFloor— delegates via ad-hoc assertions
The proxy also implements TraceRecorder[T] integration for computation graph
tracing (used by graph.CompileTraced).
Any new extension interface added post-freeze should include an EngineProxy
delegation method if it needs to be visible through traced execution.