PR #1695

open

[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759

by X-Abhishek-XView on GitHub

val_bpb

1.0759

Architecture

Transformer

Optimizer

SGD

Artifact Size

15,698,706 B

Training Techniques

Architecture

weight tying

Banked Stage 3 architecture with tied embeddings implied by the tokenizer/model setup.

parameters: null

Quantization

GPTQ

bits: 6

scope: block weights

Other

other

SpinQuant V1 Hadamard rotation applied before quantization to reduce outlier impact and quantization error in banked weight layouts.

parameters: {"enabled":true}

Test-Time Training

MP-SGD-TTT

parameters: {"prefix_docs":2000,"num_phases":3,"learning_rate":0.001,"momentum":0.9}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"phased":true,"base_model_weight_updates":true}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.75}

Regularization

logit softcap

parameters: {"parallel_lambda_asym":0}

Sequence Length

sequence_length

train_length: 32768

eval_length: null

Compression

brotli

level: null

SpinQuant V1 ported to Stage 3 banked architecture with per-slot rotation baked into weights
Composition of SpinQuant with MP-SGD-TTT
Reduced quantization error by suppressing outliers before INT6 GPTQ
Record validation BPB improvement over prior submission