PR #1033

open

Record: 0.4311 BPB - Complementary Training + Backoff N-gram Mixer + TTT

by Naazimsnh02View on GitHub

val_bpb

0.4311

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.9 MB

Training Techniques

Architecture

GQA

Grouped query attention with 8/4 heads.

parameters: {"heads":"8/4"}

BigramHash

Bigram hash cache/embedding used in the model and eval stack.

parameters: {"buckets":2048}

XSA

XSA attention used in the architecture.

parameters: {"last_layers":4}

LeakyReLU

LeakyReLU squared activation.

parameters: {"slope":0.5}

Partial RoPE

Rotary positional embedding applied to a subset of dimensions.

parameters: {"dimensions":16}

U-Net skip connections

Skip connections in a U-Net style arrangement.

parameters: null

SmearGate

SmearGate component in the architecture.

parameters: null

Value Residual

Value residual learning / value residual connections.

parameters: null

depth recurrence

Repeated layers to create virtual depth without extra parameters.

parameters: {"layers":[4,5]}

VE128

Value embedding module.

parameters: {"dimension":128,"layers":[9,10]}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"epochs":3}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Other

other

Complementary training that downweights tokens a bigram predictor would already get right.

parameters: {"complement_alpha":0.5}

other

Entropy-adaptive alpha for mixing neural and n-gram predictions during evaluation.

parameters: {"formula":"0.20 + 0.55 * sigmoid(2 * (H - 3.0))"}

other

Backoff n-gram mixer with orders 2-10 and greedy cascade.

parameters: {"orders":"2-10","buckets":4000000}

Novel Contributions

Complementary training that focuses the neural model on tokens statistical caches cannot predict well
Backoff n-gram mixer with adaptive entropy-based alpha
Score-first LoRA TTT on already-evaluated tokens
Depth recurrence to increase virtual depth without extra parameters