PR #1807
openRecord attempt: SP8192 + 3-Epoch Parallel Pre-Quant TTT + Huber WD Muon (SDPA-friendly) — val_bpb 1.07037 (3-seed mean)
by davie2009khView on GitHub
val_bpb
1.0704
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,859,703 bytes
Training Techniques
Test-Time Training
full TTT
parameters: {"epochs":3,"pre_quant":true,"parallel":true}
LR Schedule
cosine decay
parameters: {"t_max":3,"eta_min":0.0001}
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"huber_weight_decay":true}
Regularization
weight decay
parameters: {"type":"Huber","delta_rule":"3/sqrt(fan_in)"}
Quantization
GPTQ
bits: 6
scope: model weights
Weight Averaging
EMA
parameters: null
Architecture
depth recurrence
3-layer recurrence with parallel residuals and SP8192 stack
parameters: {"layers":3}
Other
other
CaseOps byte sidecar for honest BPB accounting
parameters: null
Novel Contributions
- 3-epoch pre-quant TTT schedule adapted for SDPA-only environments
- Odd-epoch-only diagnostic evaluation with runtime budget guard
- Huber-style weight decay variant for Muon
- FA3-less rebalancing of the pre-quant TTT stack to fit the 600s budget