PR #1718

open

Non-record: Eval-time lever ablations on SP8192 absolute-RoPE stack (companion to PR #1716)

by himanshudongreView on GitHub

val_bpb

1.0788

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.99 MB

Training Techniques

Architecture

BigramHash

Reduced BigramHashEmbedding projection dimension to 32 for a small but consistent gain and smaller parameter footprint.

parameters: {"dimensions":32}

Quantization

mixed int8/int6

bits: 8

scope: control tensors, small matrices, tok_emb

QAT

bits: 6

scope: matrices only

GPTQ

bits: 5

scope: matrix weights

Compression

lzma

level: null

Test-Time Training

score-first TTT

parameters: {"epochs":4}

Evaluation

sliding window eval

parameters: {"stride":128,"context_length":4096}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

Weight Averaging

SWA

parameters: {"window":1024}

Other

other

Adaptive Hadamard rotation before GPTQ to test whether random orthogonalization reduces quantization error on Muon-trained weights.

parameters: null

Structured ablation writeup documenting which eval-time levers helped or failed on the SP8192 absolute-RoPE stack.
Path A v3 passthrough quantization that reduced artifact size to fit under the 16 MB cap with no measured bpb cost.
Demonstration that increasing eval sequence length, SWA, and longer training sequence length all regress under sliding-window scoring for the same architectural reason.
Identification of a QAT and score-first TTT incompatibility on this stack.
Null result showing adaptive Hadamard GPTQ does not help on Muon-trained sub-Gaussian weights.
Argument that relative-position attention methods like ALiBi or NoPE are the correct next direction.