PR #996

open

Pre-Enrichment + EMA-GPU + SmearGate + XSA4 (val_bpb=1.1478, …

by Idan3011View on GitHub
val_bpb
1.1478
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.94 MB

Training Techniques

Weight Averaging
EMA
parameters: {"decay":0.997}
Architecture
SmearGate
Per-dimension gate blending each token with the previous token.
parameters: null
BigramHash
Hash-table embedding for token bigrams.
parameters: {"dimensions":"2048x128"}
MLP3x
Wider MLP with 3x expansion in the feedforward network.
parameters: null
weight tying
Tied input and output embeddings.
parameters: null
U-Net skip connections
Encoder-decoder style skip connections with learned skip weights.
parameters: null
XSA
Exclusive Self Attention removing self-value bias via orthogonal projection.
parameters: {"layers":4}
GELU pre-enrichment
Wider nonlinear pre-transformer enrichment block: 512->768->512 with GELU.
parameters: {"input_dim":512,"hidden_dim":768,"output_dim":512}
Quantization
QAT
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.025}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • EMA kept on GPU during training to avoid synchronous GPU-to-CPU copies each step
  • GELU pre-enrichment block before the transformer stack
  • XSA applied to the last 4 layers
  • Sliding window evaluation with stride 64 for improved val_bpb
  • Combination of SmearGate, BigramHash, EMA, and quantization-aware training in a compact artifact