PR #808

open

Record: 0.6364 BPB - Depth Recurrence + Multi-Order N-gram Backoff

by Naazimsnh02View on GitHub
val_bpb
0.6364
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.95 MB

Training Techniques

Architecture
depth recurrence
Repeats layers 4 and 5 to create more virtual layers without adding parameters.
parameters: {"layers":[4,5],"virtual_layers":13,"physical_layers":11}
BigramHash
Hash-table n-gram component used for eval-time backoff scoring.
parameters: {"vocab_size":2048}
XSA
XSA applied to the last layers of the model.
parameters: {"last_n_layers":4}
Partial RoPE
Uses rotary position embeddings on only part of the dimensions.
parameters: {"dimensions":16}
MLP3x
Three-times MLP stack with LeakyReLU(0.5)^2.
parameters: null
SmearGate
Gating mechanism used alongside BigramHash.
parameters: null
ValueEmbedding
Value embeddings added at selected layers.
parameters: {"layers":[9,10],"dimension":128}
tied embeddings
Uses tied embedding weights.
parameters: null
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Adam
weight_decay: 0.04
momentum: null
other_params: {"used_for":"TTT LoRA adapters"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Compression
lzma
level: null
Evaluation
stride-based eval
parameters: {"stride":64}
multi-order n-gram backoff
parameters: {"orders":[2,3,4,5,6,7],"entropy_adaptive_alpha":true}
multi-GPU n-gram prefill
parameters: {"num_gpus":8}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"epochs":3,"chunk_tokens":32768}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":1500,"warmdown_iters":3500}

Novel Contributions

  • Multi-order n-gram backoff over orders 2-7 with highest-order-first cascading on misses
  • Entropy-adaptive alpha that shifts trust between the neural model and n-gram backoff based on uncertainty
  • Multi-GPU n-gram prefill to avoid hash-table fragmentation across ranks
  • Depth recurrence on layers 4 and 5 to create 13 virtual layers from 11 physical layers at zero parameter cost