PR #885

open

Record: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958 (3-seed mean)

by lolrazhView on GitHub

val_bpb

0.9958

Architecture

Transformer

Optimizer

SGD

Artifact Size

~14.0 MB

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":null,"context_length":null}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"momentum":0.9,"epochs_per_chunk":3,"chunk_size":32768}

Architecture

BigramHash

Backward-looking n-gram cache / hash tables used during evaluation to blend cached predictions with neural outputs.

parameters: {"orders":"2-7","buckets_per_order":4000000,"alpha":0.2}

LeakyReLU

Uses LeakyReLU with slope 0.9 followed by squaring.

parameters: {"negative_slope":0.9}

Quantization

mixed int5/int6

bits: null

scope: front3_back1_6_middle5

QAT

bits: null

scope: all

Regularization

entropy-reg QAT

parameters: {"loss_term":"residual.pow(2).mean()","applied_when":"lr_scale < 0.15"}

LN scale

parameters: null

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"grad_clip":1}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997}

LR Schedule

cosine decay

parameters: {"across_chunks":true}

Sequence Length

sequence_length

train_length: 2048

eval_length: 32768

Novel Contributions

Backward-looking 7-gram evaluation cache with score-first updating
Entropy-regularized QAT to reduce quantization gap
Mixed int5/int6 quantization with layer-sensitive bit allocation
LeakyReLU(0.9) squared activation choice
Score-first test-time training on already-scored chunks