PR #1790

open

Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean)

by miaoyuxunView on GitHub
val_bpb
1.0699
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.9 MB

Training Techniques

Architecture
SmearGate
Forward token smear gate with zero-init transparent residual gating shared across layers.
parameters: {"width":12}
Gated Attention
Per-head multiplicative attention output gate applied before out_proj, with widened gate input.
parameters: {"width":24}
Test-Time Training
LoRA TTT
parameters: {"rank":128,"alpha":144,"warm_start_A":true,"weight_decay":1}
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"global_ttt_lr":0.001,"phased_ttt_num_phases":3,"phased_ttt_prefix_docs":2000}
Regularization
weight decay
parameters: {"value":1}
Quantization
GPTQ
bits: null
scope: model
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Combines SmearGate, widened attention output gating, LoRA TTT improvements, and phased global SGD TTT on the SP8192 base.
  • Applies phased TTT on top of the SmearGate + AttnGate architecture, rather than standard TTT.
  • Uses warm-start A for LoRA TTT, higher rank, alpha scaling, and stronger weight decay.
  • Reports a 3-seed mean validation bpb of 1.06991 with artifact size under 16MB.