val_bpb
1.0809
Architecture
Transformer
Optimizer
SGD
Artifact Size
~16.0 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 8
scope: embeddings
Architecture
depth recurrence
3-layer recurrence loop applied to layers 3-5, activated partway through training.
parameters: {"layers":[3,4,5],"num_loops":2,"activate_frac":0.35}
U-Net skip connections
Skip-gated U-Net style connections added to the network.
parameters: null
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
LeakyReLU
LeakyReLU activation used in the MLP.
parameters: {"slope":0.5}
MLP3x
Expanded MLP width to 4x.
parameters: {"multiplier":4}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Optimizer
SGD
weight_decay: 0.095
momentum: 0.9
other_params: {"learning_rate":0.005,"epochs_per_chunk":3}
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R","row_normalized":true,"newton_schulz_steps":5}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3}
LR Schedule
warmdown
parameters: {"frac":0.72}
Novel Contributions
- QK_GAIN_INIT increased to 5.5, extending the monotonic improvement trend beyond 5.25
- 3-seed record result with mean val_bpb 1.0809
- Combination of SP8192, 3-layer depth recurrence, parallel residuals, and legal TTT
- Legal score-first test-time training under Track B constraints