PR #391

closed

Add MaxParams6L_120 submission (1.2374 BPB) to track_non_record_16mb

by NishantDahalView on GitHub

val_bpb

1.2374

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.5MB

Training Techniques

Architecture

encoder-decoder depth split

6-layer encoder-decoder model with 3 encoder and 3 decoder layers plus learned skip connections, instead of the usual deeper stack.

parameters: {"layers":6,"encoder_layers":3,"decoder_layers":3}

SwiGLU MLP

Uses SwiGLU feed-forward blocks instead of universal ReLU-squared MLPs.

parameters: {"hidden_size":1280}

weight tying

Uses untied input and output embeddings.

parameters: null

KV head count

Uses full multi-head attention with one KV head per query head instead of grouped-query attention.

parameters: {"kv_heads":8}

learned per-dimension control knobs

Adds learned residual mixing, attention scaling, MLP scaling, and per-head query gain parameters.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.95

other_params: {"matrix_lr":0.045}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":256}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Regularization

weight decay

parameters: {"weight_decay":0.04}

Novel Contributions

6-layer encoder-decoder architecture with learned skip connections to maximize learning under fixed wallclock
Untied embeddings enabling a much higher embedding learning rate
Full multi-head attention with 8 KV heads instead of grouped-query attention
Per-dimension learned control parameters for residual mixing and attention/MLP scaling
SwiGLU MLP replacing universal ReLU-squared
Training and evaluating at sequence length 2048
Weight decay used both for optimization and to improve INT8 compressibility
Sliding-window evaluation with stride 256