PR #1801
openRecord: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287
by leon2k2k2kView on GitHub
val_bpb
1.0629
Architecture
Transformer
Optimizer
—
Artifact Size
15,909,401 bytes
Training Techniques
Test-Time Training
LoRA TTT
parameters: {"phased":true,"score_first":true}
Weight Averaging
EMA
parameters: null
Architecture
Gated Attention
Replaces dense GatedAttn with a narrow-input sparse attention-output gate.
parameters: {"sparse_gate":true}
Gated Attention
Updated frozen recurrent carry using learned alpha/beta coefficients, quantized to 2 decimal places.
parameters: {"frozen_carry":true}
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Quantization
GPTQ
bits: null
scope: artifact/model weights
Regularization
logit softcap
parameters: null
Novel Contributions
- Sparse attention-output gate
- Updated frozen recurrent carry with re-learned alpha/beta coefficients
- Stackable with smear gate and LQER from #1797
- Score-first phased LoRA TTT