PR #996

open

Pre-Enrichment + EMA-GPU + SmearGate + XSA4 (val_bpb=1.1478, …

by Idan3011View on GitHub

val_bpb

1.1478

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.94 MB

Training Techniques

Weight Averaging

EMA

parameters: {"decay":0.997}

Architecture

SmearGate

Per-dimension gate blending each token with the previous token.

parameters: null

BigramHash

Hash-table embedding for token bigrams.

parameters: {"dimensions":"2048x128"}

MLP3x

Wider MLP with 3x expansion in the feedforward network.

parameters: null

weight tying

Tied input and output embeddings.

parameters: null

U-Net skip connections

Encoder-decoder style skip connections with learned skip weights.

parameters: null

XSA

Exclusive Self Attention removing self-value bias via orthogonal projection.

parameters: {"layers":4}

GELU pre-enrichment

Wider nonlinear pre-transformer enrichment block: 512->768->512 with GELU.

parameters: {"input_dim":512,"hidden_dim":768,"output_dim":512}

Quantization

QAT

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.025}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

EMA kept on GPU during training to avoid synchronous GPU-to-CPU copies each step
GELU pre-enrichment block before the transformer stack
XSA applied to the last 4 layers
Sliding window evaluation with stride 64 for improved val_bpb
Combination of SmearGate, BigramHash, EMA, and quantization-aware training in a compact artifact