PR #714

open

Add 11L RotaryFix + LegalTTT + BIGRAM3072 — val_bpb 1.11869 (3-seed m…

by UpsallaView on GitHub

val_bpb

1.1187

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~16.06 MB

Training Techniques

Architecture

RoPE

Rotary NTK-scaling bug fix; train_seq_len=2048 is correctly propagated to both base_model and eval_model instead of being hardcoded to 1024.

parameters: {"train_seq_len":2048}

BigramHash

Expanded bigram vocabulary size for the hash-based bigram component.

parameters: {"vocab_size":3072}

MLP3x

Uses a 3x MLP with LeakyReLU(0.5)^2.

parameters: null

XSA

XSA applied to the last 4 layers.

parameters: {"last_n_layers":4}

Partial RoPE

Partial rotary positional embeddings using 16 of 64 dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

VE128

VE enabled in layers 9-10.

parameters: {"dim":128,"layers":[9,10]}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"every":50}

Quantization

GPTQ-lite

bits: 6

scope: all

QAT

bits: 6

scope: all

Evaluation

sliding window eval

parameters: {"chunk_size":32768}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs_per_chunk":3,"optimizer":"SGD","momentum":0.9,"freeze_blocks":0}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

cosine decay

parameters: {"warmdown_steps":3500}

Regularization

layerwise LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Other

other

Late QAT threshold tuning to extend quantization-aware adaptation time.

parameters: {"threshold":0.57}

Novel Contributions

Fixed a Rotary NTK-scaling bug by correctly propagating train_seq_len=2048 to both training and evaluation models.
Applied a previously unreported eval_model Rotary fix affecting the causal TTT scoring window.
Increased BigramHash vocabulary size to 3072.
Raised the late QAT threshold to 0.57 to allow substantially more QAT steps.
Used torch.no_grad() instead of torch.inference_mode() during TTT scoring to avoid Autograd graph corruption across RoPE cache boundaries.
Introduced Legal TTT: score-first, backward-looking test-time training on already-scored chunks.