PR #1508

open

Record: SP4096 + Compressibility Regularization — val_bpb 1.11349 (6-seed mean)

by jpfeiffeView on GitHub

val_bpb

1.1135

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.68 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

weight tying

Tied embedding / tied embeddings with SP4096 tokenizer

parameters: {"vocab_size":4096}

GQA

Grouped query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

XSA

XSA attention used on all layers

parameters: {"layers":11}

BigramHash

BigramHash component in the architecture

parameters: {"dimensions":112,"size":3072}

MLP3x

3x MLP with LeakyReLU squared activation

parameters: {"mlp_multiplier":3,"activation":"LeakyReLU"}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

LR Schedule

warmdown

parameters: {"iters":4000}

Regularization

weight decay

parameters: {"warmdown_multiplier":2}

Compression

brotli

level: 11

lzma

level: 9

Novel Contributions

SP4096 tokenizer replacing SP1024
Warmdown weight decay multiplier of 2.0 to increase compressibility
Selecting the smaller of brotli-11 and lzma-9 for artifact compression
Achieved 6-seed mean val_bpb of 1.11349 with 0% pruning and all artifacts under 16MB