PR #810

open

Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393)

by Idan3011View on GitHub
val_bpb
0.9393
Architecture
Transformer
Optimizer
Muon + AdamW
Artifact Size
14.94 MB

Training Techniques

Quantization
int6 QAT
bits: 6
scope: all
Architecture
XSA
Exclusive Self Attention on the last 4 layers to remove self-value bias via orthogonal projection.
parameters: {"layers":4}
SmearGate
Per-dimension gate blending each token with the previous token's embedding.
parameters: null
BigramHash
Hash-table embedding for token bigrams projected into model dimension.
parameters: {"dimensions":"2048x128"}
MLP3x
Wider MLP with 3x hidden expansion.
parameters: {"hidden_size":1536}
tied embeddings
Input and output embeddings are tied.
parameters: null
U-Net skip connections
Encoder-decoder style skip connections with learnable skip weights.
parameters: null
GELU pre-enrichment
Wider nonlinear pre-enrichment block before transformer layers: 512→768→512 with GELU.
parameters: {"input_dim":512,"hidden_dim":768,"output_dim":512}
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
lzma
level: 6
Evaluation
sliding window eval
parameters: {"multi_order_backoff":"2-11","entropy_adaptive_alpha":true}
multi-order n-gram backoff
parameters: {"orders":"2-11","score_first":true,"backward_looking":true}
Test-Time Training
score-first TTT-like n-gram cache
parameters: {"cache_updated_after_scoring":true,"per_gpu_independent_cache":true}
Initialization
overtone init
Initialization method credited to the modded-nanogpt baseline.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Other
other
EMA state kept on GPU during training and moved to CPU only at serialization time to speed up training.
parameters: {"reported_speedup":"37%"}
other
Pre-enrichment confidence modulation uses the magnitude of the pre-enrichment transformation as a confidence signal to modulate n-gram trust.
parameters: null

Novel Contributions

  • EMA state kept on GPU during training to avoid per-step GPU-to-CPU synchronization
  • Multi-order n-gram backoff with entropy-adaptive alpha during evaluation
  • Pre-enrichment confidence modulation to adjust n-gram trust
  • GELU pre-enrichment block (512→768→512)
  • XSA on the last 4 layers