PR #1423

open

Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean)

by aryanbhosaleView on GitHub
val_bpb
1.0791
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.12 MB

Training Techniques

Architecture
depth recurrence
Recurrent depth loop in the model stack.
parameters: {"loop":[4,5]}
U-Net skip connections
Sigmoid-gated U-Net style skip connections.
parameters: null
MLP4x
Expanded MLP width to 4x.
parameters: {"multiplier":4}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R"}
Quantization
GPTQ
bits: null
scope: embeddings and model weights
Test-Time Training
full TTT
parameters: {"pre_quant":true,"epochs":6,"learning_rate":0.0005,"freeze_first_blocks":2}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine decay
parameters: null
Compression
brotli
level: null

Novel Contributions

  • QK-Gain 5.0 applied to the SP8192 + pre-quant TTT stack
  • Pre-quantization test-time training baked into the artifact before GPTQ
  • Depth recurrence with loop 4,5
  • MuonEq-R optimizer variant
  • Sigmoid-gated U-Net skip connections
  • Record 3-seed mean val_bpb of 1.0791