PR #1826

open

Record: SP8192 + PE + SmearGate + AttnOutGate + 4ep TTT — val_bpb 1.0770 (3-seed mean)

by EthanYangTWView on GitHub
val_bpb
1.0770
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: block weights
int8
bits: 8
scope: embeddings
Architecture
SmearGate
Causal content-gated residual with zero-init transparency
parameters: null
Gated Attention
Per-head sigmoid gate on attention output
parameters: {"width":12}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"lr":0.022}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Regularization
logit softcap
parameters: {"value":30}
Test-Time Training
score-first TTT
parameters: {"epochs":4,"learning_rate":0.01}
Sequence Length
sequence_length
train_length: 8192
eval_length: null

Novel Contributions

  • Polar Express NS coefficients with per-iteration minimax-optimal tuples and row normalization
  • MIN_LR=0.10 warmdown floor
  • QK-Gain 5.25
  • SmearGate causal content-gated residual
  • Attention Output Gate with per-head sigmoid gating
  • 4-epoch score-first TTT
  • Achieved 1.0770 val_bpb 3-seed mean under 16MB artifact limit