PR #122

open

Record: Sliding Window Eval, 2048 Vocab Size, fp16 embeddings, SWA, NorMuon, FA3; mean_val_bpb:1.160

by mtybadgerView on GitHub

val_bpb

1.1603

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

15,353,270 bytes

Training Techniques

Quantization

STE QAT int6

bits: 6

scope: row-wise weights; embeddings kept in fp16

Architecture

MLP3x

Increased MLP hidden dimension from 1024 to 1536

parameters: {"hidden_dim":1536,"multiplier":3}

Optimizer

NorMuon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

SWA

parameters: {"checkpoint_interval_steps":200,"num_checkpoints":7}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":1024}

Sequence Length

sequence_length

train_length: 4096

eval_length: 1024

LR Schedule

warmdown

parameters: null

Other

other

FlashAttention 3 used to reduce step time

parameters: null

other

Expanded vocabulary size using a new 2048-token tokenizer trained on FineWeb data

parameters: {"vocab_size":2048}

Novel Contributions

Increased vocabulary size from 1024 to 2048 using a newly trained tokenizer
Replaced Muon with NorMuon
Used row-wise int6 quantization with fp16 embeddings and quantization-aware training via straight-through estimation
Applied FlashAttention 3 for faster training
Used sliding-window evaluation with stride 64 and context length 1024
Increased MLP width to 3x hidden dimension
Applied stochastic weight averaging over final checkpoints
Used a 3-run mean across multiple seeds for the reported record