PR #1920

open

Record: SP8192 PR #1874 + TTT_CHUNK_SIZE=32 — val_bpb 1.06990 (3-seed mean)

val_bpb

1.0699

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,950,196 bytes

Training Techniques

Test-Time Training

LoRA TTT

parameters: {"rank":128,"phased":true,"score_first":true,"chunk_size":32}

Architecture

SmearGate

Per-layer smoothing gate used with attention output gating.

parameters: {"width":24}

Gated Attention

Attention output gating applied as part of the model modifications.

parameters: {"width":24}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"Newton-Schulz":true,"Polar Express":true}

LR Schedule

warmdown

parameters: {"min_lr":0.1}

Quantization

GPTQ

bits: null

scope: model weights