PR #2022

open

Record: SP10240 + SimCTG + QAHSP + post-quant TTT — 1.07197 ttt-sliding-window (3-seed mean, std 0.00023)

by BharathSShankarView on GitHub

val_bpb

1.0720

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.96 MB

Training Techniques

Architecture

depth recurrence

11-layer model with 3-layer recurrence loops over layers 3-5.

parameters: {"layers":11,"recurrence_loops":3,"recurrence_range":"3-5"}

Parallel Residuals

Parallel residual connections introduced from layer 7 onward.

parameters: {"start_layer":7}

LeakyReLU

LeakyReLU(0.5)^2 activation used in SwiGLU.

parameters: {"negative_slope":0.5}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

XSA

XSA attention used in all layers.

parameters: {"layers":11}

weight tying

Input and output embeddings are tied.

parameters: null

KV head count

Model uses 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

tokenizer

SP10240 tokenizer.

parameters: {"vocab_size":10240}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"Polar Express NS Muon"}

Regularization

SimCTG

parameters: {"lambda":0.3,"margin":0.4}

Quantization

STE QAT

bits: 6

scope: activations

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 7

scope: token embeddings

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"enabled":true,"epochs":1,"learning_rate":0.005}

Compression

brotli

level: null

lzma

level: null

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

QAHSP quant-aware activation regularizer pushing hidden states onto an int6 grid during training
Post-quant test-time training on already-graded eval tokens after the legal pre-quant grading pass
Bug fix to eval_val_ttt enabling post-quant TTT to complete
Record 3-seed mean result with low standard deviation