PR #1920
openRecord: SP8192 PR #1874 + TTT_CHUNK_SIZE=32 — val_bpb 1.06990 (3-seed mean)
by bigbagView on GitHub
val_bpb
1.0699
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,950,196 bytes
Training Techniques
Test-Time Training
LoRA TTT
parameters: {"rank":128,"phased":true,"score_first":true,"chunk_size":32}
Architecture
SmearGate
Per-layer smoothing gate used with attention output gating.
parameters: {"width":24}
Gated Attention
Attention output gating applied as part of the model modifications.
parameters: {"width":24}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Newton-Schulz":true,"Polar Express":true}
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Quantization
GPTQ
bits: null
scope: model weights
Novel Contributions
- TTT_CHUNK_SIZE=32 instead of the default 48
- Phased score-first LoRA TTT with rank 128
- Smaller TTT chunks to increase gradient updates per document during evaluation
- Builds on PR #1874 verbatim with improved validation BPB