PR #60

RECORDclosed

Record: Sliding Window + FP16 Embed + 10L + Muon WD + Overtone Init (val_bpb=1.1748)

by notapplicaView on GitHub
val_bpb
1.1748
Architecture
Transformer
Optimizer
Muon
Artifact Size
~14.7 MB

Training Techniques

Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":1024}
Quantization
fp16
bits: 16
scope: tied embeddings
Architecture
Transformer depth
Increased model depth from 9 to 10 transformer layers.
parameters: {"layers":10}
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: {"decoupled_weight_decay":true}
Initialization
spectral init
Overtone spectral embedding initialization using SVD power-law spectrum shaping.
resid mix
Phase-transition residual mixing with sigmoid-scheduled resid_mix initialization.
Regularization
weight decay
parameters: {"weight_decay":0.02,"decoupled":true}
Sequence Length
sequence_length
train_length: null
eval_length: 1024

Novel Contributions

  • Sliding window evaluation with stride 64 so each token is scored with 960+ context
  • FP16 tied embedding export to avoid int8 quantization errors in input/output paths
  • Increasing the model from 9 to 10 transformer layers
  • Decoupled weight decay for the Muon optimizer
  • Overtone spectral embedding initialization with power-law SVD spectrum shaping
  • Phase-transition residual mixing initialization