PR #1790
openRecord: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean)
by miaoyuxunView on GitHub
val_bpb
1.0699
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.9 MB
Training Techniques
Architecture
SmearGate
Forward token smear gate with zero-init transparent residual gating shared across layers.
parameters: {"width":12}
Gated Attention
Per-head multiplicative attention output gate applied before out_proj, with widened gate input.
parameters: {"width":24}
Test-Time Training
LoRA TTT
parameters: {"rank":128,"alpha":144,"warm_start_A":true,"weight_decay":1}
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"global_ttt_lr":0.001,"phased_ttt_num_phases":3,"phased_ttt_prefix_docs":2000}
Regularization
weight decay
parameters: {"value":1}
Quantization
GPTQ
bits: null
scope: model
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- Combines SmearGate, widened attention output gating, LoRA TTT improvements, and phased global SGD TTT on the SP8192 base.
- Applies phased TTT on top of the SmearGate + AttnGate architecture, rather than standard TTT.
- Uses warm-start A for LoRA TTT, higher rank, alpha scaling, and stronger weight decay.
- Reports a 3-seed mean validation bpb of 1.06991 with artifact size under 16MB.