PR #169

open

Sliding Window Eval + Muon6 (val_bpb 1.1973)

by beee003View on GitHub
val_bpb
1.1973
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB

Training Techniques

Evaluation
sliding window eval
parameters: {"stride":256}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend_steps":6,"momentum_warmup_steps":1000}
LR Schedule
warmdown
parameters: {"warmdown_iters":1500}
Architecture
tied embeddings
Uses tied input and output embeddings in the baseline architecture.
parameters: null
KV head count
Uses grouped-query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
Other
other
Added a forward_logits() method for efficient single-sequence inference during evaluation.
parameters: null

Novel Contributions

  • Sliding window evaluation with stride 256 to score tokens with more prior context
  • Muon 6-step Newton-Schulz orthogonalization for improved optimizer accuracy
  • Extended momentum warmup to stabilize early training
  • Longer warmdown schedule for smoother learning rate decay
  • Added forward_logits() for efficient evaluation inference