PR #1822

open

Record: SP8192 + 9L Breadcrumb + EMA + StochDepth — val_bpb 1.17845772 (legal)

by UnwindologyView on GitHub
val_bpb
1.1785
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,897,032 bytes

Training Techniques

Architecture
weight tying
Tied embeddings between input and output embeddings.
parameters: null
Partial RoPE
Uses partial rotary positional embeddings.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"layers":9,"dimensions":512,"heads":8,"kv_heads":4}
Regularization
logit softcap
parameters: {"value":30}
dropout
parameters: {"type":"stochastic depth","expected_value_scaling":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"newton_schulz_steps":5,"warmup_momentum":0.85,"warmup_steps":500,"scope":"matrix weights only"}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"embeddings and scalars"}
Quantization
int6
bits: 6
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_steps":1200}

Novel Contributions

  • SP8192 tokenizer with byte fallback and expanded 8192 vocabulary
  • Breadcrumb gating on MLP residual contributions
  • EMA combined with stochastic depth for a 600-second wallclock regime
  • Muon optimizer for matrix weights with AdamW for embeddings and scalars
  • Int6 quantization plus zlib compression to fit under the 16 MB artifact cap
  • Single-seed record run reaching val_bpb 1.17845772