PR #391

closed

Add MaxParams6L_120 submission (1.2374 BPB) to track_non_record_16mb

by NishantDahalView on GitHub
val_bpb
1.2374
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5MB

Training Techniques

Architecture
encoder-decoder depth split
6-layer encoder-decoder model with 3 encoder and 3 decoder layers plus learned skip connections, instead of the usual deeper stack.
parameters: {"layers":6,"encoder_layers":3,"decoder_layers":3}
SwiGLU MLP
Uses SwiGLU feed-forward blocks instead of universal ReLU-squared MLPs.
parameters: {"hidden_size":1280}
weight tying
Uses untied input and output embeddings.
parameters: null
KV head count
Uses full multi-head attention with one KV head per query head instead of grouped-query attention.
parameters: {"kv_heads":8}
learned per-dimension control knobs
Adds learned residual mixing, attention scaling, MLP scaling, and per-head query gain parameters.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: {"matrix_lr":0.045}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Regularization
weight decay
parameters: {"weight_decay":0.04}

Novel Contributions

  • 6-layer encoder-decoder architecture with learned skip connections to maximize learning under fixed wallclock
  • Untied embeddings enabling a much higher embedding learning rate
  • Full multi-head attention with 8 KV heads instead of grouped-query attention
  • Per-dimension learned control parameters for residual mixing and attention/MLP scaling
  • SwiGLU MLP replacing universal ReLU-squared
  • Training and evaluating at sequence length 2048
  • Weight decay used both for optimization and to improve INT8 compressibility
  • Sliding-window evaluation with stride 256