PR #163

open

SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB)

by Focus2321View on GitHub
val_bpb
1.2091
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.2MB

Training Techniques

Architecture
tied embeddings
Keeps tok_emb.weight in fp16 instead of int8 to avoid quantization degradation in tied input/output embeddings.
parameters: null
RoPE
Uses a larger RoPE base to improve performance.
parameters: {"base":50000}
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
SwiGLU MLP
Uses a wider SwiGLU feed-forward block with multiplier 2.
parameters: {"layers":7,"dim":576,"mlp_mult":2}
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: {"beta2":0.99}
Regularization
weight decay
parameters: {"value":0.02}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.6}
Quantization
fp16
bits: 16
scope: embeddings
Other
other
Uses wallclock-based warmdown at 60% of training and a larger batch/LR configuration.
parameters: {"train_batch_tokens":262144,"matrix_lr":0.03,"scalar_lr":0.03,"tied_embed_lr":0.04}

Novel Contributions

  • Wider Transformer model with dim=576 and 7 layers using SwiGLU MLPs
  • Muon optimizer with decoupled weight decay 0.02
  • FP16 embedding passthrough to reduce tied-embedding quantization degradation
  • Sliding window evaluation with stride 64 for improved validation BPB
  • Wallclock-based warmdown at 60%
  • RoPE base 50K, beta2=0.99, and tuned batch/LR settings