PR #1801

open

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287

val_bpb

1.0629

Architecture

Transformer

Optimizer

—

Artifact Size

15,909,401 bytes

Training Techniques

Test-Time Training

LoRA TTT

parameters: {"phased":true,"score_first":true}

Weight Averaging

EMA

parameters: null

Architecture

Gated Attention

Replaces dense GatedAttn with a narrow-input sparse attention-output gate.

parameters: {"sparse_gate":true}

Gated Attention

Updated frozen recurrent carry using learned alpha/beta coefficients, quantized to 2 decimal places.

parameters: {"frozen_carry":true}

LR Schedule

warmdown

parameters: {"min_lr":0.1}

Quantization

GPTQ

bits: null

scope: artifact/model weights

Regularization

logit softcap

parameters: null