Numeric Types#
Zerfoo provides two libraries for reduced-precision floating-point arithmetic: float16 (IEEE 754 half-precision and BFloat16) and float8 (FP8 E4M3FN). These are used throughout ztensor for quantized tensor storage and mixed-precision compute.
At a Glance#
| Type | Package | Bits | Format | Range | Precision | Best For |
|---|---|---|---|---|---|---|
Float16 | float16 | 16 | 1 sign + 5 exp + 10 mantissa | ~6.55 x 10^4 | ~3-4 digits | Inference weights, activations |
BFloat16 | float16 | 16 | 1 sign + 8 exp + 7 mantissa | ~3.39 x 10^38 | ~2-3 digits | Training (same range as float32) |
Float8 | float8 | 8 | 1 sign + 4 exp + 3 mantissa (E4M3FN) | ~448 | ~1-2 digits | Quantized inference, memory savings |
float16#
go get github.com/zerfoo/float16The float16 package provides two types in a single module: Float16 (IEEE 754 half-precision) and BFloat16 (Brain Floating Point).
Float16#
Standard IEEE 754 half-precision with 10 bits of mantissa. Good precision for inference weights and activations, but limited range.
import "github.com/zerfoo/float16"
a := float16.FromFloat32(3.14159)
b := float16.FromFloat64(2.71828)
sum := a.Add(b)
product := a.Mul(b)
fmt.Printf("Sum: %f\n", sum.ToFloat32())
fmt.Printf("Product: %f\n", product.ToFloat32())BFloat16#
Same exponent range as float32 (8 exponent bits) with reduced mantissa (7 bits). Preferred for training because it avoids overflow/underflow issues that Float16 suffers from at the edges of the float32 range.
bf := float16.BFloat16FromFloat32(1.5)
f32 := bf.ToFloat32()Special Values and Classification#
f := float16.FromFloat32(3.14)
f.IsInf(0) // check for infinity
f.IsNaN() // check for NaN
f.IsFinite() // check for finite
f.IsNormal() // check for normalized
f.IsSubnormal() // check for subnormalRounding Modes#
config := float16.GetConfig()
config.DefaultRoundingMode = float16.RoundNearestEven // default
float16.Configure(config)
// Available: RoundNearestEven, RoundTowardZero,
// RoundTowardPositive, RoundTowardNegative, RoundNearestAwayVectorized Operations#
a := []float16.Float16{...}
b := []float16.Float16{...}
sum := float16.VectorAdd(a, b)
product := float16.VectorMul(a, b)float8#
go get github.com/zerfoo/float8The float8 package implements FP8 E4M3FN, an 8-bit floating-point format widely used for quantized ML inference. It has no infinity representation (the E4M3FN variant uses that encoding for additional finite values).
import "github.com/zerfoo/float8"
a := float8.FromFloat32(3.14)
b := float8.FromFloat32(2.71)
sum := a.Add(b)
product := a.Mul(b)
fmt.Printf("a + b = %f\n", sum.ToFloat32())
fmt.Printf("a * b = %f\n", product.ToFloat32())Fast Mode#
For performance-critical paths, enable lookup-table-based arithmetic:
float8.EnableFastArithmetic()
float8.EnableFastConversion()This trades memory for speed by using pre-computed tables.
When to Use Each Type#
| Scenario | Recommended Type |
|---|---|
| Model inference weights | Float16 or BFloat16 |
| Training (mixed precision) | BFloat16 (matches float32 range) |
| Quantized inference (Q8) | Float8 E4M3FN |
| CUDA kernel intermediate values | Float16 |
| Memory-constrained deployment | Float8 |
Integration with ztensor#
These types are first-class citizens in ztensor. Create tensors of any numeric type:
import (
"github.com/zerfoo/float16"
"github.com/zerfoo/ztensor/tensor"
"github.com/zerfoo/ztensor/compute"
"github.com/zerfoo/ztensor/numeric"
)
// Float16 tensor
a, _ := tensor.New[float16.Float16]([]int{2, 3}, data)
eng := compute.NewCPUEngine[float16.Float16](numeric.Float16Ops{})The compute engine handles dequantization automatically when mixing precision levels.