PR #372
closed11L + XSA4 + EMA(0.997) + seq2048 + Int5-MLP + MuonWD=0.04 + LateK-FP16 | val_bpb=1.1361
by HyperPotatoNeoView on GitHub
val_bpb
1.1361
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.79MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: attention weights
STE QAT
bits: 5
scope: MLP weights
fp16
bits: 16
scope: token embedding and last layer c_k
Architecture
XSA
Exclusive self-attention applied to the last 4 transformer layers, subtracting each value vector from the attention output before projection.
parameters: {"layers":4}
SmearGate
Learned per-dimension sigmoid gate blending each token embedding with the preceding token.
parameters: null
BigramHash
2048-bucket hashed bigram embedding table for consecutive token pairs.
parameters: {"vocab_size":2048,"dimensions":64}
KV head count
Grouped-query attention with 4 key/value heads.
parameters: {"num_kv_heads":4}
11-layer U-Net Transformer
Transformer with 11 blocks arranged as 5 encoder and 6 decoder layers with skip connections.
parameters: {"layers":11}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: {"momentum_warmup_start":0.85,"momentum_warmup_steps":500}
Weight Averaging
EMA
parameters: {"decay":0.997}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":2000,"warmup_steps":20}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Initialization
OrthoInit
Large weight matrices initialized orthogonally; output projections scaled by 1/sqrt(2*num_layers).
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
zstd
level: 22
Novel Contributions
- 11-layer U-Net Transformer with skip connections
- Mixed int6 attention and int5 MLP quantization with STE QAT
- Late-K FP16 for the final layer's key projection
- Exclusive self-attention on the last 4 layers
- EMA weight averaging with decay 0.997
- Sequence length increased to 2048
- SmearGate and BigramHash bigram-context embedding techniques
- Muon optimizer tuned with weight decay 0.04