PR #1557
openRecord: SP8192 + Improved Parallel Residuals + Muon 0.97 + TTT 5ep + N-gram Tilt + Hessian SDClip — val_bpb 1.07730
by ndokutovichView on GitHub
val_bpb
1.0773
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.97 MB
Training Techniques
Architecture
depth recurrence
3-layer depth recurrence applied in layers 3-5.
parameters: {"layers":[3,4,5]}
MLP3x
11-layer Transformer with 512d hidden size, 8 attention heads, 4 KV heads, and 4x MLP; improved parallel residuals.
parameters: {"layers":11,"dimensions":512,"heads":8,"kv_heads":4,"mlp_multiplier":4}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"row_normalized":true,"matrix_lr":0.03}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: attn+MLP
int8
bits: 8
scope: embeddings
Regularization
Hessian-Aware SDClip
parameters: {"lambda":0.175}
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","epochs":5,"learning_rate":0.005}
Other
other
Causal n-gram tilt applied during evaluation with prefix-only normalization and agreement weighting.
parameters: {"beta":2,"agree":0.1}
Compression
Brotli
level: 11
Novel Contributions
- Improved parallel residuals
- Score-first TTT
- Causal n-gram tilt
- Hessian-Aware SDClip
- Muon 0.97 with matrix learning rate tuning
- GPTQ int6 with int8 embedding quantization