PR #1700

open

Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb)

by jorge-asenjoView on GitHub
val_bpb
1.0722
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB

Training Techniques

Test-Time Training
score-first TTT
parameters: {"phased":true,"num_phases":3}
LoRA TTT
parameters: {"phased":true}
Architecture
depth recurrence
Layers 3-5 are looped with warmup during training/inference.
parameters: {"layers":[3,4,5]}
Quantization
GPTQ
bits: 7
scope: embeddings and per-layer weights
int7
bits: 7
scope: embeddings
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"matrix_lr":0.026}
Compression
brotli
level: null
Other
other
SP-8192 tokenizer with 8192-vocab SentencePiece BPE.
parameters: {"vocab_size":8192}
other
Multi-phase global SGD at test time: validation is split into phases, chunks are scored first under no_grad, then base weights are updated with SGD on already-scored tokens.
parameters: {"num_phases":3}

Novel Contributions

  • Multi-phase global SGD at test time with score-before-update legality
  • Phased LoRA test-time training
  • SP-8192 tokenizer
  • Int7 embedding quantization
  • Per-layer GPTQ with sigma clipping
  • Muon optimizer with tuned momentum and matrix learning rate
  • Depth recurrence
  • VarLen flash attention
  • Fused triton MLP
  • Brotli-compressed artifact