PR #1977

open

SP8192 + PolarExpressNS + MIN_LR + LQER Asym Rank-4 | val_bpb=1.07302 (3-seed mean)

by sahiee-devView on GitHub
val_bpb
1.0730
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,953,488 bytes

Training Techniques

Architecture
SP8192
Uses the SP8192 tokenizer base.
parameters: null
SmearGate
Adds SmearGate and AttnOutGate width 24.
parameters: {"width":24}
depth recurrence
Implements a 3-layer depth recurrence mechanism.
parameters: {"layers":3}
weight tying
Uses tied embeddings / embedding tying if implied by the base stack.
parameters: null
Gated Attention
Includes AttnOutGate as part of the attention/output gating stack.
parameters: {"width":24}
PolarExpressNS
Uses Polar Express Newton-Schulz coefficients.
parameters: null
LQER
Uses asymmetric rank-4 LQER with top-K=3.
parameters: {"rank":4,"top_k":3,"asymmetric":true}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"symmetric_row_col_normalization":true}
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Test-Time Training
score-first TTT
parameters: null
LoRA TTT
parameters: null

Novel Contributions

  • SP8192 tokenizer base
  • SmearGate + AttnOutGate width 24
  • LoRA TTT improvements
  • Phased TTT
  • Polar Express Newton-Schulz coefficients
  • MIN_LR=0.10 warmdown floor
  • LQER asymmetric rank-4 with top-K=3
  • 3-layer depth recurrence
  • Muon optimizer with symmetric row/column normalization