PR #359
closed11L MLP3x + Int6 QAT + XSA + EMA + BigramHash + FA3 (val_bpb 1.1345)
by tmustierView on GitHub
val_bpb
1.1345
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.37MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: all
Architecture
MLP3x
3x MLP expansion with SwiGLU in an 11-layer Transformer
parameters: {"layers":11,"width":512}
XSA
Cross-sequence attention applied to the last layers
parameters: {"last_n_layers":4}
BigramHash
BigramHash embedding augmentation
parameters: {"vocab_size":2048,"dim":128}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"adam_for_non_matrix_params":true}
Adam
weight_decay: 0.04
momentum: null
other_params: {"used_for":"non-matrix params"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":20}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Other
other
FlashAttention 3 (Hopper build) used for competitive throughput
parameters: {"required":true}
Novel Contributions
- 11-layer Transformer with 3x MLP expansion
- Int6 STE QAT with zstd-22 compression
- XSA on the last 4 layers
- EMA weight averaging
- BigramHash(2048) embedding augmentation
- Muon optimizer combined with Adam for non-matrix parameters
- Sliding-window evaluation with stride 64
- FlashAttention 3 Hopper build for throughput