PR #1287

open

Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean)

by dentity007View on GitHub
val_bpb
1.1048
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.95 MB

Training Techniques

Architecture
BigramHash
Removed BigramHash embeddings from the prior SOTA setup.
parameters: null
SmearGate
Removed SmearGate from the prior SOTA setup.
parameters: null
Value Residual
Removed value residual connections from the prior SOTA setup.
parameters: null
Gated Attention
Removed gated attention from the prior SOTA setup.
parameters: null
XSA
Cross-sequence attention used in all layers.
parameters: {"layers":11}
U-Net skip connections
Sigmoid-gated U-Net skip connections.
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP4x
Expanded MLP width to 4.0x.
parameters: {"multiplier":4}
weight tying
Not mentioned.
parameters: null
Optimizer
Muon
weight_decay: 0.085
momentum: null
other_params: null
Adam
weight_decay: 0.02
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: all
Compression
brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_fraction":0.667}
Regularization
weight decay
parameters: {"muon":0.085,"embeddings":0.085,"adam":0.02}
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Increased vocabulary size to 4096 using sp4096 tokenizer
  • Expanded MLP width to 4.0x
  • Applied high weight decay to improve compressibility
  • Used byte shuffle plus brotli-11 compression to fit under the size cap
  • Removed several prior architectural components to simplify the model
  • Achieved improved 3-seed mean val_bpb of 1.1048