PR #1718

open

Non-record: Eval-time lever ablations on SP8192 absolute-RoPE stack (companion to PR #1716)

by himanshudongreView on GitHub
val_bpb
1.0788
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.99 MB

Training Techniques

Architecture
BigramHash
Reduced BigramHashEmbedding projection dimension to 32 for a small but consistent gain and smaller parameter footprint.
parameters: {"dimensions":32}
Quantization
mixed int8/int6
bits: 8
scope: control tensors, small matrices, tok_emb
QAT
bits: 6
scope: matrices only
GPTQ
bits: 5
scope: matrix weights
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: {"epochs":4}
Evaluation
sliding window eval
parameters: {"stride":128,"context_length":4096}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Weight Averaging
SWA
parameters: {"window":1024}
Other
other
Adaptive Hadamard rotation before GPTQ to test whether random orthogonalization reduces quantization error on Muon-trained weights.
parameters: null

Novel Contributions

  • Structured ablation writeup documenting which eval-time levers helped or failed on the SP8192 absolute-RoPE stack.
  • Path A v3 passthrough quantization that reduced artifact size to fit under the 16 MB cap with no measured bpb cost.
  • Demonstration that increasing eval sequence length, SWA, and longer training sequence length all regress under sliding-window scoring for the same architectural reason.
  • Identification of a QAT and score-first TTT incompatibility on this stack.
  • Null result showing adaptive Hadamard GPTQ does not help on Muon-trained sub-Gaussian weights.
  • Argument that relative-position attention methods like ALiBi or NoPE are the correct next direction.