PR #381

open

Non-record: 10L FP16-Embed + Warmdown20k

by codestrongestxView on GitHub
val_bpb
1.1739
Architecture
Transformer
Optimizer
Muon
Artifact Size
14178772 bytes

Training Techniques

Architecture
tied embeddings
Uses tied input/output embeddings in a 10-layer sliding-window Transformer setup.
parameters: {"layers":10,"model_dim":512,"num_heads":8,"num_kv_heads":4}
Quantization
fp16
bits: 16
scope: embeddings
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"muon_backend_steps":5}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":20000,"warmup_steps":20}
Initialization
OvertoneInit
Baseline uses OvertoneInit; this submission is built on that merged baseline.

Novel Contributions

  • Increased WARMDOWN_ITERS from 2500 to 20000
  • Built on the merged 2026-03-19 SlidingWindow FP16-Embed 10L MuonWD OvertoneInit baseline
  • Verified non-record submission with slightly improved val_bpb over the merged seed-42 baseline
  • Included a PyTorch 2.4 SDPA GQA compatibility fallback in the training script