PR #1399

open

Record: Pre-Quant TTT + ETLB: Eval-Time Logit Bias for Neural Language Model Compression 1.0898 BPB on PR #1285 base

by AnubhavBharadwaajView on GitHub
val_bpb
1.0898
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,084,685 bytes

Training Techniques

Architecture
depth recurrence
Repeats layers 4 and 5 to create virtual layers beyond the 11 physical layers.
parameters: {"layers":4}
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Applies rotary position embeddings to a subset of dimensions.
parameters: {"dimensions":16}
XSA
Uses XSA attention across all layers.
parameters: {"layers":11}
LeakyReLU
Uses LeakyReLU squared in the MLP.
parameters: {"slope":0.5}
VE128
Uses a 128-dimensional value embedding in layers 9 and 10.
parameters: {"dimensions":128,"layers":[9,10]}
Optimizer
Muon
weight_decay: 0.09
momentum: null
other_params: {"variant":"MuonEq-R"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: all
Compression
Brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":null}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0005,"epochs":1,"freeze_blocks":9,"chunk_size":32768,"scope":"pre-quantization"}
Other
other
Eval-time logit bias optimization on already-scored context tokens with warm-started bias carried across sliding windows.
parameters: {"learning_rate":0.05,"steps":5,"clip":3,"vocab_size":4096}

Novel Contributions

  • Pre-quantization test-time training before GPTQ quantization
  • Eval-Time Logit Bias (ETLB) for sliding window evaluation
  • Warm-started vocabulary bias optimization across windows
  • New best pure neural BPB on the 10-minute 16MB track