PR #123
openRecord: Vocab 4096 + MLP 3x + Sliding Window Eval (mean val_bpb=1.1642, 3 seeds)
by saikrishnarallabandiView on GitHub
val_bpb
1.1642
Architecture
GPT
Optimizer
Muon
Artifact Size
~15.85 MB
Training Techniques
Architecture
MLP3x
Expanded the MLP hidden size to 3x the baseline using quantization savings.
parameters: {"multiplier":3,"hidden_size":1536}
tied embeddings
Uses tied input/output embeddings.
parameters: null
Quantization
STE QAT
bits: 6
scope: weights
int8
bits: 8
scope: embeddings
Weight Averaging
SWA
parameters: {"checkpoints":7}
Evaluation
sliding window eval
parameters: {"stride":256,"context_length":4096}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}
Other
other
Custom SentencePiece BPE tokenizer with vocab size 4096 trained on FineWeb.
parameters: {"vocab_size":4096}
Novel Contributions
- Custom SentencePiece BPE tokenizer with vocab size 4096
- 3x MLP expansion enabled by int6 quantization savings
- Int6 STE fake quantization with small quantization gap
- Training with 4096-token sequences
- Stochastic Weight Averaging over 7 checkpoints
- Sliding window evaluation with stride 256