PR #2112

open

Non record: In DRAM compute quantized ternary + int4

val_bpb

1.2268

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Quantization

STE QAT

bits: null

scope: Q/K/V/MLP weights and K/V activations

ternary + offset

bits: 2

scope: Q/K/V/MLP weights

int4

bits: 4

scope: Key and Value activations

Architecture

weight tying

Uses tied embeddings.

parameters: null

GQA

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

depth

Widens and deepens the model for ternary packing efficiency.

parameters: {"layers":10,"model_dim":1024,"mlp_mult":2}

Other

other

Custom BitLinear module for blockwise ternary quantization with offsets in 128-element blocks.

parameters: {"block_size":128}

Optimizes a GPT-like Transformer for in-DRAM inference on unmodified DDR4 using RowCopy and MAJ3 primitives.
Applies ternary-plus-offset blockwise quantization to Q/K/V/MLP weights.
Quantizes key and value activations to int4 with per-block offsets and scalings.
Uses STE-based quantization-aware training.
Scales the model to 10 layers and 1024 hidden dimension while keeping attention and FFN in 4-bit arithmetic.
Targets lower energy and latency for in-DRAM matrix-vector operations.