PR #391
closedAdd MaxParams6L_120 submission (1.2374 BPB) to track_non_record_16mb
by NishantDahalView on GitHub
val_bpb
1.2374
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5MB
Training Techniques
Architecture
encoder-decoder depth split
6-layer encoder-decoder model with 3 encoder and 3 decoder layers plus learned skip connections, instead of the usual deeper stack.
parameters: {"layers":6,"encoder_layers":3,"decoder_layers":3}
SwiGLU MLP
Uses SwiGLU feed-forward blocks instead of universal ReLU-squared MLPs.
parameters: {"hidden_size":1280}
weight tying
Uses untied input and output embeddings.
parameters: null
KV head count
Uses full multi-head attention with one KV head per query head instead of grouped-query attention.
parameters: {"kv_heads":8}
learned per-dimension control knobs
Adds learned residual mixing, attention scaling, MLP scaling, and per-head query gain parameters.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: {"matrix_lr":0.045}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Regularization
weight decay
parameters: {"weight_decay":0.04}
Novel Contributions
- 6-layer encoder-decoder architecture with learned skip connections to maximize learning under fixed wallclock
- Untied embeddings enabling a much higher embedding learning rate
- Full multi-head attention with 8 KV heads instead of grouped-query attention
- Per-dimension learned control parameters for residual mixing and attention/MLP scaling
- SwiGLU MLP replacing universal ReLU-squared
- Training and evaluating at sequence length 2048
- Weight decay used both for optimization and to improve INT8 compressibility
- Sliding-window evaluation with stride 256