PR #1790

open

Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean)

val_bpb

1.0699

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.9 MB

Training Techniques

Architecture

SmearGate

Forward token smear gate with zero-init transparent residual gating shared across layers.

parameters: {"width":12}

Gated Attention

Per-head multiplicative attention output gate applied before out_proj, with widened gate input.

parameters: {"width":24}

Test-Time Training

LoRA TTT

parameters: {"rank":128,"alpha":144,"warm_start_A":true,"weight_decay":1}

Optimizer

SGD

weight_decay: null

momentum: null

other_params: {"global_ttt_lr":0.001,"phased_ttt_num_phases":3,"phased_ttt_prefix_docs":2000}

Regularization

weight decay

parameters: {"value":1}

Quantization

GPTQ

bits: null

scope: model

Sequence Length

sequence_length

train_length: null

eval_length: null

Combines SmearGate, widened attention output gating, LoRA TTT improvements, and phased global SGD TTT on the SP8192 base.
Applies phased TTT on top of the SmearGate + AttnGate architecture, rather than standard TTT.
Uses warm-start A for LoRA TTT, higher rank, alpha scaling, and stronger weight decay.
Reports a 3-seed mean validation bpb of 1.06991 with artifact size under 16MB.