PR #1561

open

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0783 (3-seed mean)

by EthanYangTWView on GitHub
val_bpb
1.0783
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
depth recurrence
Triple-depth recurrence with 11 physical and 17 virtual layers.
parameters: {"physical_layers":11,"virtual_layers":17}
weight banking
Parameter banking / banking-based parameter sharing.
parameters: null
MLP3x
Fused MLP with 4x expansion (2048 hidden) and Triton/CUTLASS kernel fusion.
parameters: {"hidden_dim":2048}
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"q_heads":8,"kv_heads":4,"head_dim":64}
XSA
XSA enabled across all layers.
parameters: {"layers":11}
LeakyReLU
LeakyReLU squared activation.
parameters: {"slope":0.5}
U-Net skip connections
Parallel residual / skip-gated connections in later layers.
parameters: {"start_layer":7}
hash embedding
Eval-time hash embedding trained during TTT.
parameters: {"dimensions":[16384,512]}
Quantization
GPTQ
bits: 6
scope: weights
int8
bits: 8
scope: embeddings
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Parallel Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"lr":0.022}
SGD
weight_decay: null
momentum: 0.9
other_params: {"lr":0.01}
Compression
brotli
level: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.01,"epochs":3,"chunk_size":32000}
Regularization
logit softcap
parameters: {"value":30}
LR Schedule
warmdown
parameters: {"warmdown":0.667}
Sequence Length
sequence_length
train_length: 32000
eval_length: 8192

Novel Contributions

  • SP8192 tokenizer
  • Triple-depth recurrence with parameter banking
  • Parallel Muon optimization
  • Fused MLP Triton/CUTLASS kernel fusion
  • Score-first TTT compliance under no_grad() before updates
  • Eval-time hash embedding trained during TTT
  • GPTQ int6 with int8 embeddings and brotli compression
  • Fixed LZMA decompression wrapper from prior submission