PR #2112

open

Non record: In DRAM compute quantized ternary + int4

by fingoldinView on GitHub
val_bpb
1.2268
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Quantization
STE QAT
bits: null
scope: Q/K/V/MLP weights and K/V activations
ternary + offset
bits: 2
scope: Q/K/V/MLP weights
int4
bits: 4
scope: Key and Value activations
Architecture
weight tying
Uses tied embeddings.
parameters: null
GQA
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
depth
Widens and deepens the model for ternary packing efficiency.
parameters: {"layers":10,"model_dim":1024,"mlp_mult":2}
Other
other
Custom BitLinear module for blockwise ternary quantization with offsets in 128-element blocks.
parameters: {"block_size":128}

Novel Contributions

  • Optimizes a GPT-like Transformer for in-DRAM inference on unmodified DDR4 using RowCopy and MAJ3 primitives.
  • Applies ternary-plus-offset blockwise quantization to Q/K/V/MLP weights.
  • Quantizes key and value activations to int4 with per-block offsets and scalings.
  • Uses STE-based quantization-aware training.
  • Scales the model to 10 layers and 1024 hidden dimension while keeping attention and FFN in 4-bit arithmetic.
  • Targets lower energy and latency for in-DRAM matrix-vector operations.