PR #172

open

Add 3xMLP + Mixed Quant + Blockade/Sigma submission (val_bpb: 1.1812)

val_bpb

1.1812

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.37 MB

Training Techniques

Architecture

MLP3x

Expanded MLP width from 2x baseline to 3x to improve token representation capacity.

parameters: {"mlp_mult":3,"model_dim":512,"layers":9,"heads":8,"kv_heads":4}

tied embeddings

Used tied embeddings with a higher embedding learning rate.

parameters: null

Quantization

mixed int8/int6

bits: 8

scope: attention int8, MLP int6

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":2500}

Regularization

weight decay

parameters: {"muon_wd":0.02}

Other

other

Blockade attention diversity to encourage inter-head suppression and diverse attention patterns.

parameters: {"strength":0.15}

other

Sigma residuals: uncertainty-gated residual connections that dampen noisy head contributions.

parameters: {"strength":0.3}

3x MLP expansion to increase model capacity within the training budget
Mixed quantization using INT8 for attention and INT6 for MLP to fit under the 16MB cap
Blockade attention diversity mechanism to suppress overlapping heads
Uncertainty-gated sigma residuals to stabilize training
10-minute 8xH100 training run with reported mixed-roundtrip val_bpb of 1.1812