PR #1280

open

Record: AR Self-Gen GPTQ + XSA-11 + BigramHash3072x112 (mean 1.1156)

by aamodbhattView on GitHub

val_bpb

1.1156

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.9 MB

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: all

Architecture

BigramHash

Bigram hash embedding component used in the model stack.

parameters: {"size":1536}

XSA

XSA attention component applied to the last layers.

parameters: {"last_n_layers":4}

MLP3x

Three-layer MLP block with LeakyReLU^2 activation.

parameters: null

RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":16,"base_dimensions":64}

VE128

Value residual component in selected layers.

parameters: {"layers":[9,10],"dimension":128}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Compression

lzma

level: 7

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"chunk_tokens":32768,"epochs":"2/3/4 adaptive"}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"parallel":true,"ns_steps":3,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

LR Schedule

cosine decay

parameters: {"warmdown_iters":3500}

Regularization

LN scale

parameters: {"rule":"1/sqrt(layer+1)"}

Novel Contributions

Muon-style Newton-Schulz optimization applied to test-time training
Entropy-adaptive TTT epoch selection based on chunk NLL
Score-first legal TTT protocol with global NLL synchronization across DDP ranks
GPTQ-lite int6 quantization with lzma compression
Combined stack of BigramHash, XSA, partial RoPE, EMA, and Tight SWA