PR #395

open

Add MaxParams6L_120 submission (1.2374 BPB) to track_non_record_16mb

by NishantDahalView on GitHub
val_bpb
1.2374
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5MB

Training Techniques

Architecture
depth reduction / encoder-decoder split
6-layer encoder-decoder architecture with 3 encoder and 3 decoder layers plus learned skip connections, instead of the usual 9-11 layers.
parameters: {"layers":6,"encoder_layers":3,"decoder_layers":3}
SwiGLU MLP
Replaced universal ReLU-squared MLP with a SwiGLU feedforward block.
parameters: {"hidden_size":1280}
weight tying
Used untied input and output embeddings instead of tied embeddings.
parameters: null
KV head count
Used full multi-head attention with one KV head per query head instead of grouped-query attention.
parameters: {"heads":8,"kv_heads":8}
per-dimension control parameters
Added learned residual mixing, attention scaling, MLP scaling, and query gain controls.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: {"matrix_lr":0.045,"embed_lr":0.6}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Regularization
weight decay
parameters: {"value":0.04}
Quantization
int8
bits: 8
scope: all
Other
other
Learned per-dimension residual mixing, attention/MLP scaling, and query gain control tensors stored in fp32 through quantization.
parameters: {"fp32_control_tensors":true}

Novel Contributions

  • 6-layer encoder-decoder architecture with learned skip connections
  • SwiGLU MLP instead of ReLU-squared
  • Untied embeddings with very high embedding learning rate
  • Full multi-head attention with 8 KV heads instead of GQA
  • Per-dimension learned control parameters for residual, attention, MLP, and query scaling
  • INT8 quantization combined with zlib compression
  • Training at sequence length 2048
  • Weight decay tuned to improve both quality and artifact size