PR #1825

closed

Record: SP8192 + PE + SmearGate + AttnOutGate + 4ep TTT — val_bpb 1.0770 (3-seed mean)

by EthanYangTWView on GitHub
val_bpb
1.0770
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Architecture
SmearGate
Causal content-gated residual with zero initialization.
parameters: null
Gated Attention
Per-head sigmoid gate on attention outputs with zero initialization.
parameters: {"width":12}
QK-Gain
Scaled query-key gain used in attention.
parameters: {"gain":5.25}
PE
Positional encoding / positional embedding component.
parameters: null
SP8192
SP8192 tokenizer / representation variant.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"lr":0.022}
AdamW
weight_decay: null
momentum: null
other_params: null
Regularization
logit softcap
parameters: {"value":30}
weight decay
parameters: {"value":0.095}
LR Schedule
warmdown
parameters: {"floor":0.1}
Test-Time Training
score-first TTT
parameters: {"epochs":4,"learning_rate":0.01,"momentum":0.9}
Quantization
GPTQ
bits: 6
scope: all
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 8192
eval_length: null

Novel Contributions

  • SmearGate causal content-gated residual with zero initialization
  • Attention output gating with per-head sigmoid gates
  • 4-epoch score-first test-time training
  • Polar Express NS coefficients
  • MIN_LR warmdown floor
  • QK-Gain 5.25