PR #52

RECORDclosed

New SOTA attempt (val_bpb=1.2014)

by spokane-wayView on GitHub
val_bpb
1.2014
Architecture
Transformer
Optimizer
Muon
Artifact Size
15868326 bytes

Training Techniques

Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
Sequence Length
sequence_length
train_length: 4096
eval_length: null
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_steps":1500,"momentum_warmup_start":0.92,"tied_embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02,"train_batch_tokens":393216}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null

Novel Contributions

  • Longer training context with sequence length 4096
  • Aggressively tuned Muon optimizer momentum and learning rates
  • Reduced training batch tokens to increase update frequency
  • Extended momentum warmup to stabilize high-momentum training
  • Longer warmdown schedule for the shorter wallclock-limited run
  • Int8 quantized roundtrip submission with improved post-quantization BPB