PR #1852

open

Record: Pre-Quant TTT + Void Compass — val_bpb 1.0282 (3-seed mean)

by G3sparkyView on GitHub
val_bpb
1.0282
Architecture
Transformer
Optimizer
AdamW
Artifact Size
< 16 MB

Training Techniques

Test-Time Training
full TTT
parameters: {"epochs":21,"learning_rate":"5e-4 to 5e-5 cosine","timing":"pre-quantization"}
Architecture
depth recurrence
Layers 3-5 loop with activated recurrence during forward pass.
parameters: {"layers":[3,5],"num_loops":2,"activated_at_frac":0.35}
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Uses rotary position embeddings on a subset of dimensions.
parameters: {"dimensions":"16/64"}
LeakyReLU
Leaky ReLU activation used in the MLP.
parameters: {"slope":0.5}
XSA
XSA applied across all layers.
parameters: null
KV head count
Grouped attention configuration with 4 KV heads.
parameters: {"kv_heads":4}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
Optimizer
AdamW
weight_decay: 0.095
momentum: null
other_params: {"ema_decay":0.9965}
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
cosine decay
parameters: {"start_lr":0.0005,"min_lr":0.00005}
Compression
lzma
level: null
brotli
level: 11
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: token embeddings
Other
other
Void fraction monitoring used as a training diagnostic during pre-quantization TTT to detect memorization.
parameters: {"stable_void_fraction":0.58}

Novel Contributions

  • Pre-quantization test-time training before GPTQ
  • Void fraction compass diagnostic for memorization detection
  • LZMA-compressed self-extracting code wrapper to fit the size budget
  • Brotli-11 model compression
  • 8-GPU synchronous gradient averaging during TTT