PR #826
closedRecord: Order-9 N-gram Backoff + Score-First TTT + GPTQ-Int5 (0.2951 BPB)
by himanshudongreView on GitHub
val_bpb
0.2951
Architecture
11-layer Transformer-like model with 512d, GQA 8/4, MLP 3.0x, BigramHash, SmearGate, XSA, Partial RoPE, LN Scale, U-Net skips, VE128
Optimizer
Muon
Artifact Size
~13.4 MB
Training Techniques
Architecture
BigramHash
Adds hashed bigram features with projected embeddings.
parameters: {"buckets":4096,"dim":128}
SmearGate
Learned gate blending current and previous token embeddings.
parameters: null
XSA
Exclusive self-attention applied to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions.
parameters: {"dims":"16/64"}
GQA
Grouped-query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
U-Net skips
Learned skip connections between encoder and decoder halves.
parameters: null
Value Embeddings
Value embeddings used in later layers.
parameters: {"layers":[9,10],"dim":128}
LeakyReLU(0.9)^2
Uses LeakyReLU with slope 0.9 applied twice in the MLP.
parameters: {"slope":0.9}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"banking":true,"ns5_steps":true}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"applied_to":"embeddings","learning_rate":0.035}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Test-Time Training
score-first TTT
parameters: {"rank":8,"learning_rate":0.01,"chunk_size":2048,"epochs_per_chunk":3,"polyak_decay":0.998,"temperature":0.98}
Quantization
GPTQ
bits: 5
scope: full model
Initialization
OrthoInit
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
cosine decay
parameters: null
Regularization
weight decay
parameters: {"value":0.04}
layerwise LN scale
parameters: null
Other
other
Order-9 n-gram backoff evaluation cache with entropy-adaptive interpolation and score-first backward-looking updates.
parameters: {"orders":[2,9],"buckets_per_order":4194304,"alpha_range":[0.05,0.6],"entropy_center":3,"chunk_size":1000000}
other
Perplexity-ranked shard ordering curriculum for training.
parameters: null
Novel Contributions
- Order-9 n-gram backoff evaluation cache with entropy-adaptive interpolation
- Score-first test-time training with LoRA on Q, V, and LM head
- GPTQ int5 full-Hessian quantization with LZMA compression
- Perplexity-ranked shard ordering curriculum
- LeakyReLU(0.9)^2 MLP variant with frontier_lean architecture stack