PR #1798

closed

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287

val_bpb

1.0629

Architecture

Transformer

Optimizer

—

Artifact Size

~15.91 MB

Training Techniques

Test-Time Training

LoRA TTT

parameters: {"phased":true,"score_first":true}

Architecture

Gated Attention

Replaces dense attention output gating with a narrow-input sparse gate; includes learned per-head gate scalars and quant-time gate scaling.

parameters: {"init_std":0.005,"quant_gate":true}

depth recurrence

Loop4-5 style repeated execution of layers 4 and 5.

parameters: {"layers":[4,5]}

weight tying

CaseOps tokenizer with shared vocabulary/operator tokens is mentioned, but no explicit model weight tying is described.

parameters: null

Quantization

GPTQ

bits: null

scope: model weights

LR Schedule

warmdown

parameters: {"min_lr":0.1}

Regularization

logit softcap

parameters: {"fused_softcapped_ce":true}

Other

other

CaseOps reversible case preprocessing with operator tokens and per-token byte sidecar accounting for BPB on original bytes.

parameters: {"tokenizer":"CaseOps","byte_sidecar":true}

Sparse attention-output gate replacing the dense gated attention path
Updated frozen recurrent carry with re-learned and 2-decimal-quantized alpha/beta coefficients
Phased score-first LoRA TTT stack
CaseOps reversible tokenizer and original-byte BPB accounting
Quant-time gate scaling to recover artifact overhead