PR #302
openNon-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-run val_bpb=1.1520)
by JackYoung27View on GitHub
val_bpb
1.1520
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.1 MB
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6
Architecture
XSA
Uses XSA in the last 3 layers.
parameters: {"layers":3}
BigramHash
BigramHash feature with vocabulary size 10240.
parameters: {"dimensions":10240}
MLP3x
Three-layer MLP blocks.
parameters: {"layers":3}
weight tying
Tied fp16 embeddings.
parameters: null
Initialization
OrthoInit
Orthogonal initialization with muP.
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
SWA
parameters: {"start":200}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"scope":"MLP weights in last 3 blocks","learning_rate":null,"decay_prior":true}
Other
other
Pre-Q/K RMSNorm applied to attention input before Q and K projections only.
parameters: null
other
Reptile meta-learning with K=1 inner SGD step and interpolation during the last 10% of training.
parameters: {"k":1,"train_fraction":0.1}
Novel Contributions
- Pre-Q/K RMSNorm to stabilize the RoPE-facing path under int5/int6
- Online causal TTT with Krause-style decay prior to prevent drift
- Reptile meta-learning in the last 10% of training to improve eval-time TTT adaptation
- Evaluation-time adaptation of MLP weights in the last 3 blocks only