PR #1700
openAdd SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb)
by jorge-asenjoView on GitHub
val_bpb
1.0722
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB
Training Techniques
Test-Time Training
score-first TTT
parameters: {"phased":true,"num_phases":3}
LoRA TTT
parameters: {"phased":true}
Architecture
depth recurrence
Layers 3-5 are looped with warmup during training/inference.
parameters: {"layers":[3,4,5]}
Quantization
GPTQ
bits: 7
scope: embeddings and per-layer weights
int7
bits: 7
scope: embeddings
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"matrix_lr":0.026}
Compression
brotli
level: null
Other
other
SP-8192 tokenizer with 8192-vocab SentencePiece BPE.
parameters: {"vocab_size":8192}
other
Multi-phase global SGD at test time: validation is split into phases, chunks are scored first under no_grad, then base weights are updated with SGD on already-scored tokens.
parameters: {"num_phases":3}
Novel Contributions
- Multi-phase global SGD at test time with score-before-update legality
- Phased LoRA test-time training
- SP-8192 tokenizer
- Int7 embedding quantization
- Per-layer GPTQ with sigma clipping
- Muon optimizer with tuned momentum and matrix learning rate
- Depth recurrence
- VarLen flash attention
- Fused triton MLP
- Brotli-compressed artifact