PR #1826
openRecord: SP8192 + PE + SmearGate + AttnOutGate + 4ep TTT — val_bpb 1.0770 (3-seed mean)
by EthanYangTWView on GitHub
val_bpb
1.0770
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: block weights
int8
bits: 8
scope: embeddings
Architecture
SmearGate
Causal content-gated residual with zero-init transparency
parameters: null
Gated Attention
Per-head sigmoid gate on attention output
parameters: {"width":12}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.97
other_params: {"lr":0.022}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Regularization
logit softcap
parameters: {"value":30}
Test-Time Training
score-first TTT
parameters: {"epochs":4,"learning_rate":0.01}
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Novel Contributions
- Polar Express NS coefficients with per-iteration minimax-optimal tuples and row normalization
- MIN_LR=0.10 warmdown floor
- QK-Gain 5.25
- SmearGate causal content-gated residual
- Attention Output Gate with per-head sigmoid gating
- 4-epoch score-first TTT
- Achieved 1.0770 val_bpb 3-seed mean under 16MB artifact limit