PR #754
openNon-Record: 11L Parallel Muon + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1253)
by aryanbhosaleView on GitHub
val_bpb
1.1253
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15 MB
Training Techniques
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"momentum_schedule":"0.92→0.99 over 1500 steps","newton_schulz_steps":5,"parameter_banking":true,"async_reduce_scatter_all_gather":true}
Architecture
MLP3x
3x expansion MLP with LeakyReLU(0.5)^2 activation
parameters: {"hidden_dim":1536}
SmearGate
Additional gating mechanism in the architecture
parameters: null
BigramHash
Bigram hash feature module
parameters: {"size":1536,"dim":128}
Value Residual
Caches V from layer 0 and blends via learned lambda
parameters: null
Gated Attention
Per-head sigmoid gating for attention outputs
parameters: null
XSA
Exclusive self-attention applied to the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied to a subset of head dimensions
parameters: {"dimensions":"16/64"}
tied embeddings
Input and output embeddings are tied
parameters: null
Initialization
OrthoInit
Orthogonal initialization
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"interval":"every 50 steps when scale < 0.2"}
Quantization
GPTQ-lite
bits: 6
scope: per-row weights
STE QAT
bits: 6
scope: all weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"chunk_size":32000}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"chunk_size":32000}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- Parallel Muon with parameter banking and batched Newton-Schulz updates
- LeakyReLU(0.5)^2 MLP 3x expansion
- Legal score-first test-time training (TTT) with score-before-update enforcement
- EMA plus SWA model averaging
- GPTQ-lite int6 quantization with per-row 5-percentile clip search
- Flash Attention 3 and torch.compile(fullgraph=True) without DDP