PR #1218

RECORDopen

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)

by clarkkevView on GitHub
val_bpb
1.0978
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,916,170

Training Techniques

Evaluation
sliding window eval
parameters: null
Architecture
XSA
Use XSA in all layers instead of only the last 4.
parameters: null
MLP3x
Widened MLPs from 3x to 4x multiplier.
parameters: {"mlp_mult":4}
U-Net skip connections
Added sigmoid-gated skip connections to the U-Net.
parameters: null
Gated Attention
Removed gated attention from the model.
parameters: null
Value Residual
Removed value residuals from the model.
parameters: null
BigramHash
Removed hash embeddings.
parameters: null
SmearGate
Removed the smear gate.
parameters: null
Optimizer
Muon
weight_decay: 0.085
momentum: null
other_params: {"embeddings_weight_decay":0.085}
AdamW
weight_decay: 0.02
momentum: null
other_params: null
Regularization
weight decay
parameters: {"muon_weight_decay":0.085,"embeddings_weight_decay":0.085,"adam_weight_decay":0.02}
LR Schedule
cosine decay
parameters: null
Quantization
GPTQ
bits: null
scope: all
Compression
brotli
level: null
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Fixed a bug in sliding window evaluation that double-counted tokens near the end of validation.
  • Increased vocabulary size from 1024 to 4096 using a new SentencePiece tokenizer.
  • Used a larger, more strongly regularized model with higher weight decay and wider MLPs.
  • Added coprime-stride data loading to reduce nearby repeated document exposure.
  • Added GPTQ Hessian-aware quantization.
  • Used more efficient byte shuffle + brotli compression.
  • Added sigmoid-gated skip connections to the U-Net.
  • Increased qk_gain_init from 1.5 to 4.