PR #172

open

Add 3xMLP + Mixed Quant + Blockade/Sigma submission (val_bpb: 1.1812)

by GMaN1911View on GitHub
val_bpb
1.1812
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.37 MB

Training Techniques

Architecture
MLP3x
Expanded MLP width from 2x baseline to 3x to improve token representation capacity.
parameters: {"mlp_mult":3,"model_dim":512,"layers":9,"heads":8,"kv_heads":4}
tied embeddings
Used tied embeddings with a higher embedding learning rate.
parameters: null
Quantization
mixed int8/int6
bits: 8
scope: attention int8, MLP int6
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":2500}
Regularization
weight decay
parameters: {"muon_wd":0.02}
Other
other
Blockade attention diversity to encourage inter-head suppression and diverse attention patterns.
parameters: {"strength":0.15}
other
Sigma residuals: uncertainty-gated residual connections that dampen noisy head contributions.
parameters: {"strength":0.3}

Novel Contributions

  • 3x MLP expansion to increase model capacity within the training budget
  • Mixed quantization using INT8 for attention and INT6 for MLP to fit under the 16MB cap
  • Blockade attention diversity mechanism to suppress overlapping heads
  • Uncertainty-gated sigma residuals to stabilize training
  • 10-minute 8xH100 training run with reported mixed-roundtrip val_bpb of 1.1812