PR #1399
openRecord: Pre-Quant TTT + ETLB: Eval-Time Logit Bias for Neural Language Model Compression 1.0898 BPB on PR #1285 base
by AnubhavBharadwaajView on GitHub
val_bpb
1.0898
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,084,685 bytes
Training Techniques
Architecture
depth recurrence
Repeats layers 4 and 5 to create virtual layers beyond the 11 physical layers.
parameters: {"layers":4}
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Applies rotary position embeddings to a subset of dimensions.
parameters: {"dimensions":16}
XSA
Uses XSA attention across all layers.
parameters: {"layers":11}
LeakyReLU
Uses LeakyReLU squared in the MLP.
parameters: {"slope":0.5}
VE128
Uses a 128-dimensional value embedding in layers 9 and 10.
parameters: {"dimensions":128,"layers":[9,10]}
Optimizer
Muon
weight_decay: 0.09
momentum: null
other_params: {"variant":"MuonEq-R"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: all
Compression
Brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":null}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0005,"epochs":1,"freeze_blocks":9,"chunk_size":32768,"scope":"pre-quantization"}
Other
other
Eval-time logit bias optimization on already-scored context tokens with warm-started bias carried across sliding windows.
parameters: {"learning_rate":0.05,"steps":5,"clip":3,"vocab_size":4096}
Novel Contributions
- Pre-quantization test-time training before GPTQ quantization
- Eval-Time Logit Bias (ETLB) for sliding window evaluation
- Warm-started vocabulary bias optimization across windows
- New best pure neural BPB on the 10-minute 16MB track