PR #850

open

Record: 0.3212 BPB — Complementary N-gram 65K + Int5 GPTQ + LoRA TTT

by callithyiaView on GitHub

val_bpb

0.3212

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~14.9 MB

Training Techniques

Architecture

BigramHash

BigramHash 4096-bucket embedding used in the model architecture.

parameters: {"buckets":4096}

MLP3x

MLP with 3.0x expansion and LeakyReLU(0.9) squared.

parameters: {"expansion":3,"hidden":1536}

XSA

XSA applied on the last 4 layers.

parameters: {"layers":4}

Value Residual Learning

Value Residual Learning applied across layers 1-10.

parameters: {"layers":[1,10]}

Gated Attention

Gated Attention with bias 4.0 on all layers.

parameters: {"bias":4}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz":5,"per_group_banking":true,"encoder_lr":0.025,"decoder_lr":0.05}

Weight Averaging

Polyak averaging

parameters: {"decay":0.998}

Compression

lzma

level: 9

Evaluation

order-9 n-gram backoff cache

parameters: {"orders":[2,9],"chunk_size":65536,"cache_buckets":4000000,"entropy_adaptive_alpha_blending":true}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"qv_blocks":[9,10],"learning_rate":0.003,"polyak_decay":0.998,"score_first":true}

LR Schedule

WSD

parameters: {"stable_fraction":0.75,"decay":"cosine"}

Quantization

GPTQ

bits: 5

scope: all

Regularization

EMA

parameters: {"decay":0.997}

Other

other

Complementary training that downweights bigram-predictable tokens during training.

parameters: {"alpha":0.5}

other

Late QAT with Soft-Round quantization-aware training triggered near the end of training.

parameters: {"trigger_fraction":0.85}

Novel Contributions

Complementary training combined with an order-9 n-gram cache
65K-token chunks for more frequent cache refreshes
Full Hessian GPTQ int5 with LZMA compression
LoRA test-time training with Polyak averaging and score-first backward-looking protocol
Per-order entropy centers and multipliers for n-gram alpha computation