PR #1016

open

11L VRL + Parallel Muon + Legal TTT v2 (val_bpb=1.1269, non-record)

val_bpb
1.1269
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.8 MB

Training Techniques

Architecture
Value Residual
Layer 0 V output is blended into subsequent layers via learned sigmoid gates.
parameters: {"layers":10}
BigramHash
Bigram hash embedding size increased to improve bpb.
parameters: {"dimensions":3072}
weight tying
Tied embeddings are used.
parameters: null
LeakyReLU
LeakyReLU squared activation is used in the MLP.
parameters: {"slope":0.5}
GQA
Grouped query attention with 4 KV heads.
parameters: {"kv_heads":4}
Weight Averaging
Tight SWA
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"full_length_windows_only":true,"fixed_scoring_offset":true}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"chunk_size":32000}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
LN scale
parameters: {"scale":"1/sqrt(L+1)"}

Novel Contributions

  • Value Residual Learning (VRL) with learned sigmoid gates
  • BigramHash size doubled to 3072
  • Tight SWA used instead of EMA when snapshots are available
  • zstd-22 artifact compression
  • Sliding window evaluation bug fix
  • TTT enabled by default with all blocks unfrozen
  • Dropped full GPTQ in favor of GPTQ-lite