PR #1800
closedRecord: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287
by leon2k2k2kView on GitHub
val_bpb
1.0629
Architecture
Transformer
Optimizer
—
Artifact Size
15,909,401 bytes
Training Techniques
Test-Time Training
LoRA TTT
parameters: {"phased":true}
Weight Averaging
EMA
parameters: null
Architecture
Gated Attention
Replaces the dense attention-output gate with a narrow-input sparse gate.
parameters: null
depth recurrence
Uses a frozen recurrent carry with learned alpha/beta coefficients.
parameters: {"alpha":[[0.23,0.04,0.03],[0.13,-0.34,0.01],[0.06,0.19,-0.02]],"beta":[1.56,1.85,2.13]}
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Regularization
logit softcap
parameters: null
Quantization
GPTQ
bits: 8
scope: model weights
Other
other
Polar Express Newton-Schulz coefficients.
parameters: null
Novel Contributions
- Sparse attention-output gate
- Updated frozen recurrent carry
- Phased LoRA TTT stack
- Stackable with smear gate and LQER from #1797