PR #885
openRecord: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958 (3-seed mean)
by lolrazhView on GitHub
val_bpb
0.9958
Architecture
Transformer
Optimizer
SGD
Artifact Size
~14.0 MB
Training Techniques
Evaluation
sliding window eval
parameters: {"stride":null,"context_length":null}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs_per_chunk":3,"chunk_size":32768}
Architecture
BigramHash
Backward-looking n-gram cache / hash tables used during evaluation to blend cached predictions with neural outputs.
parameters: {"orders":"2-7","buckets_per_order":4000000,"alpha":0.2}
LeakyReLU
Uses LeakyReLU with slope 0.9 followed by squaring.
parameters: {"negative_slope":0.9}
Quantization
mixed int5/int6
bits: null
scope: front3_back1_6_middle5
QAT
bits: null
scope: all
Regularization
entropy-reg QAT
parameters: {"loss_term":"residual.pow(2).mean()","applied_when":"lr_scale < 0.15"}
LN scale
parameters: null
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"grad_clip":1}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997}
LR Schedule
cosine decay
parameters: {"across_chunks":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: 32768
Novel Contributions
- Backward-looking 7-gram evaluation cache with score-first updating
- Entropy-regularized QAT to reduce quantization gap
- Mixed int5/int6 quantization with layer-sensitive bit allocation
- LeakyReLU(0.9) squared activation choice
- Score-first test-time training on already-scored chunks