PR #1218

RECORDopen

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)

by clarkkevView on GitHub

val_bpb

1.0978

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,916,170

Training Techniques

Evaluation

sliding window eval

parameters: null

Architecture

XSA

Use XSA in all layers instead of only the last 4.

parameters: null

MLP3x

Widened MLPs from 3x to 4x multiplier.

parameters: {"mlp_mult":4}

U-Net skip connections

Added sigmoid-gated skip connections to the U-Net.

parameters: null

Gated Attention

Removed gated attention from the model.

parameters: null

Value Residual

Removed value residuals from the model.

parameters: null

BigramHash

Removed hash embeddings.

parameters: null

SmearGate

Removed the smear gate.

parameters: null

Optimizer

Muon

weight_decay: 0.085

momentum: null

other_params: {"embeddings_weight_decay":0.085}

AdamW

weight_decay: 0.02

momentum: null

other_params: null

Regularization

weight decay

parameters: {"muon_weight_decay":0.085,"embeddings_weight_decay":0.085,"adam_weight_decay":0.02}

LR Schedule

cosine decay

parameters: null

Quantization

GPTQ

bits: null

scope: all

Compression

brotli

level: null

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

Fixed a bug in sliding window evaluation that double-counted tokens near the end of validation.
Increased vocabulary size from 1024 to 4096 using a new SentencePiece tokenizer.
Used a larger, more strongly regularized model with higher weight decay and wider MLPs.
Added coprime-stride data loading to reduce nearby repeated document exposure.
Added GPTQ Hessian-aware quantization.
Used more efficient byte shuffle + brotli compression.
Added sigmoid-gated skip connections to the U-Net.
Increased qk_gain_init from 1.5 to 4.