PR #321
openAdd record: Optimizer Tuning + Sliding Window Eval (val_bpb=1.1864)
by andreanjosView on GitHub
val_bpb
1.1864
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,861,337 bytes
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend_steps":10,"grad_clip_norm":1,"beta2":0.99,"scalar_lr":0.02}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":10000}
Regularization
gradient clipping
parameters: {"norm":1}
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
weight tying
Tied input and output embeddings in the baseline architecture.
parameters: null
Quantization
int8
bits: 8
scope: all weights
Compression
zlib
level: null
Novel Contributions
- Optimizer tuning with longer warmdown, more Muon backend steps, gradient clipping, higher beta2, and lower scalar learning rate
- Training with longer sequence length (2048 tokens)
- Sliding window validation evaluation with stride 64 to give tokens more context during scoring
- Post-quant int8 + zlib artifact fitting under the 16MB submission cap
- Reproducible multi-seed validation showing a new record-level val_bpb