PR #1287

open

Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean)

by dentity007View on GitHub

val_bpb

1.1048

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.95 MB

Training Techniques

Architecture

BigramHash

Removed BigramHash embeddings from the prior SOTA setup.

parameters: null

SmearGate

Removed SmearGate from the prior SOTA setup.

parameters: null

Value Residual

Removed value residual connections from the prior SOTA setup.

parameters: null

Gated Attention

Removed gated attention from the prior SOTA setup.

parameters: null

XSA

Cross-sequence attention used in all layers.

parameters: {"layers":11}

U-Net skip connections

Sigmoid-gated U-Net skip connections.

parameters: null

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP4x

Expanded MLP width to 4.0x.

parameters: {"multiplier":4}

weight tying

Not mentioned.

parameters: null

Optimizer

Muon

weight_decay: 0.085

momentum: null

other_params: null

Adam

weight_decay: 0.02

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ

bits: 6

scope: all

Compression

brotli

level: 11

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_fraction":0.667}

Regularization

weight decay

parameters: {"muon":0.085,"embeddings":0.085,"adam":0.02}

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

Increased vocabulary size to 4096 using sp4096 tokenizer
Expanded MLP width to 4.0x
Applied high weight decay to improve compressibility
Used byte shuffle plus brotli-11 compression to fit under the size cap
Removed several prior architectural components to simplify the model
Achieved improved 3-seed mean val_bpb of 1.1048