PR #1016
open11L VRL + Parallel Muon + Legal TTT v2 (val_bpb=1.1269, non-record)
by ADIITJView on GitHub
val_bpb
1.1269
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.8 MB
Training Techniques
Architecture
Value Residual
Layer 0 V output is blended into subsequent layers via learned sigmoid gates.
parameters: {"layers":10}
BigramHash
Bigram hash embedding size increased to improve bpb.
parameters: {"dimensions":3072}
weight tying
Tied embeddings are used.
parameters: null
LeakyReLU
LeakyReLU squared activation is used in the MLP.
parameters: {"slope":0.5}
GQA
Grouped query attention with 4 KV heads.
parameters: {"kv_heads":4}
Weight Averaging
Tight SWA
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"full_length_windows_only":true,"fixed_scoring_offset":true}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"chunk_size":32000}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
LN scale
parameters: {"scale":"1/sqrt(L+1)"}
Novel Contributions
- Value Residual Learning (VRL) with learned sigmoid gates
- BigramHash size doubled to 3072
- Tight SWA used instead of EMA when snapshots are available
- zstd-22 artifact compression
- Sliding window evaluation bug fix
- TTT enabled by default with all blocks unfrozen
- Dropped full GPTQ in favor of GPTQ-lite