PR #1898

open

Record: Partial SpinQuant (start_layer=5) + PR#1851 Stack — val_bpb 1.06614 (3-seed mean)

by X-Abhishek-XView on GitHub
val_bpb
1.0661
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.63MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: model weights
GPTQ
bits: 6
scope: embeddings and weights
Architecture
SmearGate
Gated residual smearing used in the PR#1851 base stack.
parameters: null
weight tying
CaseOps SP8192 tokenizer / tied-embedding related base stack not explicitly stated as tied embeddings in the text; included only if implied by canonical stack is not certain.
parameters: null
Gated Attention
Sparse attention gating via SparseAttnGate.
parameters: {"scale":0.5}
Test-Time Training
score-first TTT
parameters: {"num_phases":3,"rank":80,"prefix_docs":2500}
Regularization
weight decay
parameters: {"weight_decay":0.5}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Optimizer
AdamW
weight_decay: 0.5
momentum: null
other_params: {"beta2":0.99,"ttt_beta2":0.99}
Initialization
OrthoInit
Hadamard pre-rotation / SpinQuant-style orthogonal rotation regenerated from seed at eval time.
Other
other
Partial SpinQuant with start_layer=5 applies Hadamard pre-rotation only to layers 5-10, reducing entropy overhead and enabling EMBED_BITS=6 under the 16MB cap.
parameters: {"start_layer":5,"layers_rotated":6,"modules_rotated":12}

Novel Contributions

  • Partial SpinQuant with SPINQUANT_START_LAYER=5
  • Hadamard pre-rotation applied only to layers 5-10 instead of all layers
  • Reduced brotli entropy overhead enough to make EMBED_BITS=6 fit under the 16MB cap
  • Zero serialized bytes for SpinQuant rotations by regenerating from seed at eval time
  • Score-first phased TTT stack combined with PR#1851 base and PR#1855 hparams