PR #1825
closedRecord: SP8192 + PE + SmearGate + AttnOutGate + 4ep TTT — val_bpb 1.0770 (3-seed mean)
by EthanYangTWView on GitHub
val_bpb
1.0770
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB
Training Techniques
Architecture
SmearGate
Causal content-gated residual with zero initialization.
parameters: null
Gated Attention
Per-head sigmoid gate on attention outputs with zero initialization.
parameters: {"width":12}
QK-Gain
Scaled query-key gain used in attention.
parameters: {"gain":5.25}
PE
Positional encoding / positional embedding component.
parameters: null
SP8192
SP8192 tokenizer / representation variant.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"lr":0.022}
AdamW
weight_decay: null
momentum: null
other_params: null
Regularization
logit softcap
parameters: {"value":30}
weight decay
parameters: {"value":0.095}
LR Schedule
warmdown
parameters: {"floor":0.1}
Test-Time Training
score-first TTT
parameters: {"epochs":4,"learning_rate":0.01,"momentum":0.9}
Quantization
GPTQ
bits: 6
scope: all
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Novel Contributions
- SmearGate causal content-gated residual with zero initialization
- Attention output gating with per-head sigmoid gates
- 4-epoch score-first test-time training
- Polar Express NS coefficients
- MIN_LR warmdown floor
- QK-Gain 5.25