PR #1822
openRecord: SP8192 + 9L Breadcrumb + EMA + StochDepth — val_bpb 1.17845772 (legal)
by UnwindologyView on GitHub
val_bpb
1.1785
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,897,032 bytes
Training Techniques
Architecture
weight tying
Tied embeddings between input and output embeddings.
parameters: null
Partial RoPE
Uses partial rotary positional embeddings.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"layers":9,"dimensions":512,"heads":8,"kv_heads":4}
Regularization
logit softcap
parameters: {"value":30}
dropout
parameters: {"type":"stochastic depth","expected_value_scaling":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"newton_schulz_steps":5,"warmup_momentum":0.85,"warmup_steps":500,"scope":"matrix weights only"}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"embeddings and scalars"}
Quantization
int6
bits: 6
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_steps":1200}
Novel Contributions
- SP8192 tokenizer with byte fallback and expanded 8192 vocabulary
- Breadcrumb gating on MLP residual contributions
- EMA combined with stochastic depth for a 600-second wallclock regime
- Muon optimizer for matrix weights with AdamW for embeddings and scalars
- Int6 quantization plus zlib compression to fit under the 16 MB artifact cap
- Single-seed record run reaching val_bpb 1.17845772