val_bpb
1.1478
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.94 MB
Training Techniques
Weight Averaging
EMA
parameters: {"decay":0.997}
Architecture
SmearGate
Per-dimension gate blending each token with the previous token.
parameters: null
BigramHash
Hash-table embedding for token bigrams.
parameters: {"dimensions":"2048x128"}
MLP3x
Wider MLP with 3x expansion in the feedforward network.
parameters: null
weight tying
Tied input and output embeddings.
parameters: null
U-Net skip connections
Encoder-decoder style skip connections with learned skip weights.
parameters: null
XSA
Exclusive Self Attention removing self-value bias via orthogonal projection.
parameters: {"layers":4}
GELU pre-enrichment
Wider nonlinear pre-transformer enrichment block: 512->768->512 with GELU.
parameters: {"input_dim":512,"hidden_dim":768,"output_dim":512}
Quantization
QAT
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.025}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Novel Contributions
- EMA kept on GPU during training to avoid synchronous GPU-to-CPU copies each step
- GELU pre-enrichment block before the transformer stack
- XSA applied to the last 4 layers
- Sliding window evaluation with stride 64 for improved val_bpb
- Combination of SmearGate, BigramHash, EMA, and quantization-aware training in a compact artifact