PR #1185

open

[10min_16mb] 0.9641 BPB: LeakyReLU² + Score-First TTT + N-gram Backoff Cache

by skoustav35View on GitHub

val_bpb

0.9641

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,989,583 bytes

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"squared":true,"negative_slope":0.5}

XSA

Exclusive self-attention used in later layers.

parameters: {"layers":[6,7,8,9,10]}

weight tying

Tied input and output embeddings.

parameters: null

Value Residual

Adds value residual connections.

parameters: null

Gated Attention

Uses gated attention blocks.

parameters: null

SmearGate

Applies SmearGate gating mechanism.

parameters: null

VE128

Value embedding enabled on selected layers.

parameters: {"layers":[8,9,10],"dimension":128}

BigramHash

Uses hashed bigram features.

parameters: {"size":2048}

Partial RoPE

Applies partial rotary positional embeddings.

parameters: {"dimensions":"16/64"}

MLP3x

Uses 3x MLP width.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.9985}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002}

Muon

weight_decay: null

momentum: null

other_params: null

Adam

weight_decay: null

momentum: null

other_params: {"split":true}

Quantization

int6

bits: 6

scope: per-row

Compression

lzma

level: null

Test-Time Training

score-first TTT

parameters: {"epochs":3,"chunk_size":32000,"stride":64,"optimizer":"SGD"}

Evaluation

n-gram backoff cache

parameters: {"orders":[2,3,4,5,6,7,8,9],"backoff":"highest matching order","smoothing":"Laplace add-1","entropy_adaptive_alpha":true}

Regularization

LN scale

parameters: null

Novel Contributions

LeakyReLU squared architecture with gated attention and value residuals
Legal score-first test-time training that scores tokens before updates
Eval-time multi-order n-gram backoff cache with Laplace smoothing
Entropy-adaptive alpha scaling for blending neural and n-gram probabilities
Int6 per-row quantization with LZMA compression to fit the 16MB limit