PR #1700

open

Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb)

by jorge-asenjoView on GitHub

val_bpb

1.0722

Architecture

Transformer

Optimizer

Muon

Artifact Size

16MB

Training Techniques

Test-Time Training

score-first TTT

parameters: {"phased":true,"num_phases":3}

LoRA TTT

parameters: {"phased":true}

Architecture

depth recurrence

Layers 3-5 are looped with warmup during training/inference.

parameters: {"layers":[3,4,5]}

Quantization

GPTQ

bits: 7

scope: embeddings and per-layer weights

int7

bits: 7

scope: embeddings

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"matrix_lr":0.026}

Compression

brotli

level: null

Other

other

SP-8192 tokenizer with 8192-vocab SentencePiece BPE.

parameters: {"vocab_size":8192}

other

Multi-phase global SGD at test time: validation is split into phases, chunks are scored first under no_grad, then base weights are updated with SGD on already-scored tokens.

parameters: {"num_phases":3}

Novel Contributions

Multi-phase global SGD at test time with score-before-update legality
Phased LoRA test-time training
SP-8192 tokenizer
Int7 embedding quantization
Per-layer GPTQ with sigma clipping
Muon optimizer with tuned momentum and matrix learning rate
Depth recurrence
VarLen flash attention
Fused triton MLP
Brotli-compressed artifact