PR #169

open

Sliding Window Eval + Muon6 (val_bpb 1.1973)

val_bpb

1.1973

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9 MB

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":256}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"backend_steps":6,"momentum_warmup_steps":1000}

LR Schedule

warmdown

parameters: {"warmdown_iters":1500}

Architecture

tied embeddings

Uses tied input and output embeddings in the baseline architecture.

parameters: null

KV head count

Uses grouped-query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Sequence Length

sequence_length

train_length: null

eval_length: 1024

Other

other

Added a forward_logits() method for efficient single-sequence inference during evaluation.

parameters: null

Sliding window evaluation with stride 256 to score tokens with more prior context
Muon 6-step Newton-Schulz orthogonalization for improved optimizer accuracy
Extended momentum warmup to stabilize early training
Longer warmdown schedule for smoother learning rate decay
Added forward_logits() for efficient evaluation inference