PR #321

open

Add record: Optimizer Tuning + Sliding Window Eval (val_bpb=1.1864)

by andreanjosView on GitHub
val_bpb
1.1864
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,861,337 bytes

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend_steps":10,"grad_clip_norm":1,"beta2":0.99,"scalar_lr":0.02}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":10000}
Regularization
gradient clipping
parameters: {"norm":1}
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
weight tying
Tied input and output embeddings in the baseline architecture.
parameters: null
Quantization
int8
bits: 8
scope: all weights
Compression
zlib
level: null

Novel Contributions

  • Optimizer tuning with longer warmdown, more Muon backend steps, gradient clipping, higher beta2, and lower scalar learning rate
  • Training with longer sequence length (2048 tokens)
  • Sliding window validation evaluation with stride 64 to give tokens more context during scoring
  • Post-quant int8 + zlib artifact fitting under the 16MB submission cap
  • Reproducible multi-seed validation showing a new record-level val_bpb