PR #1898

open

Record: Partial SpinQuant (start_layer=5) + PR#1851 Stack — val_bpb 1.06614 (3-seed mean)

by X-Abhishek-XView on GitHub

val_bpb

1.0661

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.63MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: model weights

GPTQ

bits: 6

scope: embeddings and weights

Architecture

SmearGate

Gated residual smearing used in the PR#1851 base stack.

parameters: null

weight tying

CaseOps SP8192 tokenizer / tied-embedding related base stack not explicitly stated as tied embeddings in the text; included only if implied by canonical stack is not certain.

parameters: null

Gated Attention

Sparse attention gating via SparseAttnGate.

parameters: {"scale":0.5}

Test-Time Training

score-first TTT

parameters: {"num_phases":3,"rank":80,"prefix_docs":2500}

Regularization

weight decay

parameters: {"weight_decay":0.5}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Optimizer

AdamW

weight_decay: 0.5

momentum: null

other_params: {"beta2":0.99,"ttt_beta2":0.99}

Initialization

OrthoInit

Hadamard pre-rotation / SpinQuant-style orthogonal rotation regenerated from seed at eval time.

Other

other

Partial SpinQuant with start_layer=5 applies Hadamard pre-rotation only to layers 5-10, reducing entropy overhead and enabling EMBED_BITS=6 under the 16MB cap.

parameters: {"start_layer":5,"layers_rotated":6,"modules_rotated":12}

Novel Contributions

Partial SpinQuant with SPINQUANT_START_LAYER=5
Hadamard pre-rotation applied only to layers 5-10 instead of all layers
Reduced brotli entropy overhead enough to make EMBED_BITS=6 fit under the 16MB cap
Zero serialized bytes for SpinQuant rotations by regenerating from seed at eval time
Score-first phased TTT stack combined with PR#1851 base and PR#1855 hparams