PR #1800

closed

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287

val_bpb

1.0629

Architecture

Transformer

Optimizer

—

Artifact Size

15,909,401 bytes

Training Techniques

Test-Time Training

LoRA TTT

parameters: {"phased":true}

Weight Averaging

EMA

parameters: null

Architecture

Gated Attention

Replaces the dense attention-output gate with a narrow-input sparse gate.

parameters: null

depth recurrence

Uses a frozen recurrent carry with learned alpha/beta coefficients.

parameters: {"alpha":[[0.23,0.04,0.03],[0.13,-0.34,0.01],[0.06,0.19,-0.02]],"beta":[1.56,1.85,2.13]}

LR Schedule

warmdown

parameters: {"min_lr":0.1}

Regularization

logit softcap

parameters: null

Quantization

GPTQ

bits: 8

scope: model weights

Other

other

Polar Express Newton-Schulz coefficients.

parameters: null