PR #1276
openRecord: MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0
by BiggerDABOSSView on GitHub
val_bpb
1.1100
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~16 MB
Training Techniques
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"variant":"MuonEq-R","parallel":true}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"batch_seqs":32,"grad_clip":1}
Architecture
XSA
Cross-sequence attention extended to all layers
parameters: {"layers":11}
BigramHash
Bigram hash embedding component
parameters: {"vocab_size":1536}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions
parameters: {"dimensions":16}
VE128
Value residual enhancement module
parameters: {"layers":[9,10],"dimension":128}
MLP3x
Three-times widened MLP with LeakyReLU activation
parameters: {"activation":"LeakyReLU","activation_slope":0.5}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Sequence Length
sequence_length
train_length: 32768
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Other
other
Context-only SLOT test-time delta optimization on past tokens only, reinitialized each sliding window
parameters: {"delta_shape":[1,1,512],"optimizer":"AdamW","learning_rate":0.005,"steps":8}
Novel Contributions
- MuonEq-R row-normalization before Newton-Schulz orthogonalization
- Context-only SLOT optimized on past tokens during sliding-window evaluation
- XSA extended from last 4 layers to all 11 layers
- QK gain increased to 5.0
- Combined stack targeting about 1.110 val_bpb