PR #1695

open

[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759

by X-Abhishek-XView on GitHub
val_bpb
1.0759
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,698,706 B

Training Techniques

Architecture
weight tying
Banked Stage 3 architecture with tied embeddings implied by the tokenizer/model setup.
parameters: null
Quantization
GPTQ
bits: 6
scope: block weights
Other
other
SpinQuant V1 Hadamard rotation applied before quantization to reduce outlier impact and quantization error in banked weight layouts.
parameters: {"enabled":true}
Test-Time Training
MP-SGD-TTT
parameters: {"prefix_docs":2000,"num_phases":3,"learning_rate":0.001,"momentum":0.9}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"phased":true,"base_model_weight_updates":true}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.75}
Regularization
logit softcap
parameters: {"parallel_lambda_asym":0}
Sequence Length
sequence_length
train_length: 32768
eval_length: null
Compression
brotli
level: null

Novel Contributions

  • SpinQuant V1 ported to Stage 3 banked architecture with per-slot rotation baked into weights
  • Composition of SpinQuant with MP-SGD-TTT
  • Reduced quantization error by suppressing outliers before INT6 GPTQ
  • Record validation BPB improvement over prior submission