PR #810

open

Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393)

by Idan3011View on GitHub

val_bpb

0.9393

Architecture

Transformer

Optimizer

Muon + AdamW

Artifact Size

14.94 MB

Training Techniques

Quantization

int6 QAT

bits: 6

scope: all

Architecture

XSA

Exclusive Self Attention on the last 4 layers to remove self-value bias via orthogonal projection.

parameters: {"layers":4}

SmearGate

Per-dimension gate blending each token with the previous token's embedding.

parameters: null

BigramHash

Hash-table embedding for token bigrams projected into model dimension.

parameters: {"dimensions":"2048x128"}

MLP3x

Wider MLP with 3x hidden expansion.

parameters: {"hidden_size":1536}

tied embeddings

Input and output embeddings are tied.

parameters: null

U-Net skip connections

Encoder-decoder style skip connections with learnable skip weights.

parameters: null

GELU pre-enrichment

Wider nonlinear pre-enrichment block before transformer layers: 512→768→512 with GELU.

parameters: {"input_dim":512,"hidden_dim":768,"output_dim":512}

GQA

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

lzma

level: 6

Evaluation

sliding window eval

parameters: {"multi_order_backoff":"2-11","entropy_adaptive_alpha":true}

multi-order n-gram backoff

parameters: {"orders":"2-11","score_first":true,"backward_looking":true}

Test-Time Training

score-first TTT-like n-gram cache

parameters: {"cache_updated_after_scoring":true,"per_gpu_independent_cache":true}

Initialization

overtone init

Initialization method credited to the modded-nanogpt baseline.

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Other

other

EMA state kept on GPU during training and moved to CPU only at serialization time to speed up training.

parameters: {"reported_speedup":"37%"}

other

Pre-enrichment confidence modulation uses the magnitude of the pre-enrichment transformation as a confidence signal to modulate n-gram trust.

parameters: null

Novel Contributions

EMA state kept on GPU during training to avoid per-step GPU-to-CPU synchronization
Multi-order n-gram backoff with entropy-adaptive alpha during evaluation
Pre-enrichment confidence modulation to adjust n-gram trust
GELU pre-enrichment block (512→768→512)
XSA on the last 4 layers