PR #549
RECORDclosedRecord: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean)
by abaybektursunView on GitHub
val_bpb
1.1194
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.95 MB
Training Techniques
Architecture
MLP3x
Three-layer MLP stack using LeakyReLU(0.5)^2 activation.
parameters: {"layers":3}
BigramHash
BigramHash token feature embedding.
parameters: {"size":1536}
XSA
XSA applied to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied partially to a subset of dimensions.
parameters: {"dimensions":16}
KV head count
Uses 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50,"tight":true}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"chunk_size":32768,"epochs":3,"learning_rate":0.002,"momentum":0.9,"freeze_blocks":0,"gradient_clip":1,"legal":true}
LR Schedule
cosine decay
parameters: {"warmdown_steps":3500}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Other
other
Parameter Banking with batched Newton-Schulz orthogonalization and async reduce-scatter/all-gather to speed up training.
parameters: {"step_time_ms":83.4}
Novel Contributions
- LeakyReLU(0.5)^2 activation replacing standard relu^2
- Legal score-first test-time training under torch.inference_mode()
- Parallel Muon / Parameter Banking optimizer stack
- All-block-unfrozen TTT adaptation (freeze=0) with 3 epochs
- GPTQ-lite int6 quantization with lzma compression