PR #859

open

Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03

by bigbagView on GitHub

val_bpb

0.1582

Architecture

Transformer

Optimizer

—

Artifact Size

15.59 MB

Training Techniques

Architecture

learned mixer head

A Linear(512 → 7) head predicts per-token expert mixing weights over the neural model and n-gram orders 2-7.

parameters: {"input_dim":512,"output_dim":7}

frozen n-gram oracle

Precomputed n-gram tables from training data are used as a frozen lookup oracle during training.

parameters: {"orders":"2-7"}

MLP3.5x

Transformer MLP width is 3.5x the model dimension.

parameters: {"multiplier":3.5}

MHA 8/8

Multi-head attention configuration with 8 attention heads over 8 layers.

parameters: {"layers":8,"heads":8}

Quantization

mixed int5/int6

bits: null

scope: model weights

GPTQ

bits: null

scope: model weights

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: null

Evaluation

score-first backward-looking n-gram cache

parameters: {"orders":"2-7"}

Test-Time Training

none

parameters: {"ttt_epochs":0}

LR Schedule

matrix learning rate tuning

parameters: {"matrix_lr":0.03}

Other

other

Systematic hyperparameter screening across 79+ experiments to find the improved matrix learning rate.

parameters: {"experiments":79}

Novel Contributions

Learned mixer head that predicts per-token expert weights
Removing TTT entirely while improving performance
Increasing MATRIX_LR from 0.025 to 0.03
Systematic screening of 79+ experiments to discover the better learning rate
Backward-looking score-first n-gram cache with learned mixing weights