PR #1801

open

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287

by leon2k2k2kView on GitHub
val_bpb
1.0629
Architecture
Transformer
Optimizer
Artifact Size
15,909,401 bytes

Training Techniques

Test-Time Training
LoRA TTT
parameters: {"phased":true,"score_first":true}
Weight Averaging
EMA
parameters: null
Architecture
Gated Attention
Replaces dense GatedAttn with a narrow-input sparse attention-output gate.
parameters: {"sparse_gate":true}
Gated Attention
Updated frozen recurrent carry using learned alpha/beta coefficients, quantized to 2 decimal places.
parameters: {"frozen_carry":true}
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Quantization
GPTQ
bits: null
scope: artifact/model weights
Regularization
logit softcap
parameters: null

Novel Contributions

  • Sparse attention-output gate
  • Updated frozen recurrent carry with re-learned alpha/beta coefficients
  • Stackable with smear gate and LQER from #1797
  • Score-first phased LoRA TTT