PR #395

open

Add MaxParams6L_120 submission (1.2374 BPB) to track_non_record_16mb

by NishantDahalView on GitHub

val_bpb

1.2374

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.5MB

Training Techniques

Architecture

depth reduction / encoder-decoder split

6-layer encoder-decoder architecture with 3 encoder and 3 decoder layers plus learned skip connections, instead of the usual 9-11 layers.

parameters: {"layers":6,"encoder_layers":3,"decoder_layers":3}

SwiGLU MLP

Replaced universal ReLU-squared MLP with a SwiGLU feedforward block.

parameters: {"hidden_size":1280}

weight tying

Used untied input and output embeddings instead of tied embeddings.

parameters: null

KV head count

Used full multi-head attention with one KV head per query head instead of grouped-query attention.

parameters: {"heads":8,"kv_heads":8}

per-dimension control parameters

Added learned residual mixing, attention scaling, MLP scaling, and query gain controls.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.95

other_params: {"matrix_lr":0.045,"embed_lr":0.6}

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":256}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Regularization

weight decay

parameters: {"value":0.04}

Quantization

int8

bits: 8

scope: all

Other

other

Learned per-dimension residual mixing, attention/MLP scaling, and query gain control tensors stored in fp32 through quantization.

parameters: {"fp32_control_tensors":true}

Novel Contributions

6-layer encoder-decoder architecture with learned skip connections
SwiGLU MLP instead of ReLU-squared
Untied embeddings with very high embedding learning rate
Full multi-head attention with 8 KV heads instead of GQA
Per-dimension learned control parameters for residual, attention, MLP, and query scaling
INT8 quantization combined with zlib compression
Training at sequence length 2048
Weight decay tuned to improve both quality and artifact size