PR #1280

open

Record: AR Self-Gen GPTQ + XSA-11 + BigramHash3072x112 (mean 1.1156)

by aamodbhattView on GitHub
val_bpb
1.1156
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.9 MB

Training Techniques

Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
BigramHash
Bigram hash embedding component used in the model stack.
parameters: {"size":1536}
XSA
XSA attention component applied to the last layers.
parameters: {"last_n_layers":4}
MLP3x
Three-layer MLP block with LeakyReLU^2 activation.
parameters: null
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":16,"base_dimensions":64}
VE128
Value residual component in selected layers.
parameters: {"layers":[9,10],"dimension":128}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
lzma
level: 7
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"chunk_tokens":32768,"epochs":"2/3/4 adaptive"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"parallel":true,"ns_steps":3,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
LR Schedule
cosine decay
parameters: {"warmdown_iters":3500}
Regularization
LN scale
parameters: {"rule":"1/sqrt(layer+1)"}

Novel Contributions

  • Muon-style Newton-Schulz optimization applied to test-time training
  • Entropy-adaptive TTT epoch selection based on chunk NLL
  • Score-first legal TTT protocol with global NLL synchronization across DDP ranks
  • GPTQ-lite int6 quantization with lzma compression
  • Combined stack of BigramHash, XSA, partial RoPE, EMA, and Tight SWA