PR #1423
openRecord: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean)
by aryanbhosaleView on GitHub
val_bpb
1.0791
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.12 MB
Training Techniques
Architecture
depth recurrence
Recurrent depth loop in the model stack.
parameters: {"loop":[4,5]}
U-Net skip connections
Sigmoid-gated U-Net style skip connections.
parameters: null
MLP4x
Expanded MLP width to 4x.
parameters: {"multiplier":4}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R"}
Quantization
GPTQ
bits: null
scope: embeddings and model weights
Test-Time Training
full TTT
parameters: {"pre_quant":true,"epochs":6,"learning_rate":0.0005,"freeze_first_blocks":2}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine decay
parameters: null
Compression
brotli
level: null
Novel Contributions
- QK-Gain 5.0 applied to the SP8192 + pre-quant TTT stack
- Pre-quantization test-time training baked into the artifact before GPTQ
- Depth recurrence with loop 4,5
- MuonEq-R optimizer variant
- Sigmoid-gated U-Net skip connections
- Record 3-seed mean val_bpb of 1.0791