val_bpb
1.2268
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Quantization
STE QAT
bits: null
scope: Q/K/V/MLP weights and K/V activations
ternary + offset
bits: 2
scope: Q/K/V/MLP weights
int4
bits: 4
scope: Key and Value activations
Architecture
weight tying
Uses tied embeddings.
parameters: null
GQA
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
depth
Widens and deepens the model for ternary packing efficiency.
parameters: {"layers":10,"model_dim":1024,"mlp_mult":2}
Other
other
Custom BitLinear module for blockwise ternary quantization with offsets in 128-element blocks.
parameters: {"block_size":128}
Novel Contributions
- Optimizes a GPT-like Transformer for in-DRAM inference on unmodified DDR4 using RowCopy and MAJ3 primitives.
- Applies ternary-plus-offset blockwise quantization to Q/K/V/MLP weights.
- Quantizes key and value activations to int4 with per-block offsets and scalings.
- Uses STE-based quantization-aware training.
- Scales the model to 10 layers and 1024 hidden dimension while keeping attention and FFN in 4-bit arithmetic.
- Targets lower energy and latency for in-DRAM matrix-vector operations.