PR #338
openRecord: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256)
by alertcatView on GitHub
val_bpb
1.1254
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.55 MB
Training Techniques
Architecture
XSA
Exclusive Self Attention applied to the last 4 layers.
parameters: {"layers":4}
EMA
Exponential moving average component with decay 0.997.
parameters: {"decay":0.997}
MLP3x
Transformer MLP expanded to 3x hidden size.
parameters: {"expansion":3}
SmearGate
Learned token blending gate.
parameters: null
BigramHash
Bigram hashing module with 2048 buckets.
parameters: {"buckets":2048}
OrthoInit
Orthogonal initialization strategy.
parameters: null
Quantization
int6 QAT
bits: 6
scope: block weights
mixed int5/int6
bits: null
scope: MLP and attention weights
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
SGD
weight_decay: null
momentum: 0.9
other_params: {"used_for":"TTT fine-tuning"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"interval":200,"checkpoint_avg":7}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"epochs":3,"learning_rate":0.002,"momentum":0.9,"frozen_blocks":2}
Initialization
OrthoInit
Orthogonal initialization.
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Novel Contributions
- First submission combining XSA (Exclusive Self Attention), EMA, and Test-Time Training.
- TTT adaptation on validation token stream with 3 epochs of SGD fine-tuning.
- Mixed precision-tier quantization using int5 for MLP weights and int6 for attention weights.
- Use of a 12-layer model enabled by compression savings from int5 MLP quantization.
- Sliding window evaluation with stride 64 to report val_bpb.