PR #826

closed

Record: Order-9 N-gram Backoff + Score-First TTT + GPTQ-Int5 (0.2951 BPB)

val_bpb

0.2951

Architecture

11-layer Transformer-like model with 512d, GQA 8/4, MLP 3.0x, BigramHash, SmearGate, XSA, Partial RoPE, LN Scale, U-Net skips, VE128

Optimizer

Muon

Artifact Size

~13.4 MB

Training Techniques

Architecture

BigramHash

Adds hashed bigram features with projected embeddings.

parameters: {"buckets":4096,"dim":128}

SmearGate

Learned gate blending current and previous token embeddings.

parameters: null

XSA

Exclusive self-attention applied to the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions.

parameters: {"dims":"16/64"}

GQA

Grouped-query attention with fewer KV heads than query heads.

parameters: {"query_heads":8,"kv_heads":4}

U-Net skips

Learned skip connections between encoder and decoder halves.

parameters: null

Value Embeddings

Value embeddings used in later layers.

parameters: {"layers":[9,10],"dim":128}

LeakyReLU(0.9)^2

Uses LeakyReLU with slope 0.9 applied twice in the MLP.

parameters: {"slope":0.9}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"banking":true,"ns5_steps":true}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"applied_to":"embeddings","learning_rate":0.035}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Test-Time Training

score-first TTT

parameters: {"rank":8,"learning_rate":0.01,"chunk_size":2048,"epochs_per_chunk":3,"polyak_decay":0.998,"temperature":0.98}

Quantization

GPTQ

bits: 5

scope: full model

Initialization

OrthoInit

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

cosine decay

parameters: null

Regularization

weight decay

parameters: {"value":0.04}

layerwise LN scale

parameters: null

Other

other

Order-9 n-gram backoff evaluation cache with entropy-adaptive interpolation and score-first backward-looking updates.

parameters: {"orders":[2,9],"buckets_per_order":4194304,"alpha_range":[0.05,0.6],"entropy_center":3,"chunk_size":1000000}

other

Perplexity-ranked shard ordering curriculum for training.

parameters: null

Novel Contributions

Order-9 n-gram backoff evaluation cache with entropy-adaptive interpolation
Score-first test-time training with LoRA on Q, V, and LM head
GPTQ int5 full-Hessian quantization with LZMA compression
Perplexity-ranked shard ordering curriculum
LeakyReLU(0.9)^2 MLP variant with frontier_lean architecture stack