PR #122

open

Record: Sliding Window Eval, 2048 Vocab Size, fp16 embeddings, SWA, NorMuon, FA3; mean_val_bpb:1.160

by mtybadgerView on GitHub
val_bpb
1.1603
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
15,353,270 bytes

Training Techniques

Quantization
STE QAT int6
bits: 6
scope: row-wise weights; embeddings kept in fp16
Architecture
MLP3x
Increased MLP hidden dimension from 1024 to 1536
parameters: {"hidden_dim":1536,"multiplier":3}
Optimizer
NorMuon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
SWA
parameters: {"checkpoint_interval_steps":200,"num_checkpoints":7}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":1024}
Sequence Length
sequence_length
train_length: 4096
eval_length: 1024
LR Schedule
warmdown
parameters: null
Other
other
FlashAttention 3 used to reduce step time
parameters: null
other
Expanded vocabulary size using a new 2048-token tokenizer trained on FineWeb data
parameters: {"vocab_size":2048}

Novel Contributions

  • Increased vocabulary size from 1024 to 2048 using a newly trained tokenizer
  • Replaced Muon with NorMuon
  • Used row-wise int6 quantization with fp16 embeddings and quantization-aware training via straight-through estimation
  • Applied FlashAttention 3 for faster training
  • Used sliding-window evaluation with stride 64 and context length 1024
  • Increased MLP width to 3x hidden dimension
  • Applied stochastic weight averaging over final checkpoints
  • Used a 3-run mean across multiple seeds for the reported record