PR #1724
openSP8192 + 9-Layer + Breadcrumb Gating + EMA + Stochastic Depth - 1.1803 BPB (legal)
by UnwindologyView on GitHub
val_bpb
1.1803
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,880,130 bytes
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Uses partial rotary positional embeddings.
parameters: null
breadcrumb gating
Learned sigmoid gate on each MLP contribution for residual regularization.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
stochastic depth
parameters: null
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"newton_schulz_steps":5,"warmup_momentum_start":0.85,"warmup_steps":500,"adamw_for":["embeddings","scalars"]}
Quantization
int6
bits: 6
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_steps":1200}
Novel Contributions
- SP8192 tokenizer with byte fallback
- Breadcrumb gating on MLP residual contributions
- EMA weight averaging with decay 0.997
- Stochastic depth regularization
- Muon optimizer with Newton-Schulz updates
- Int6 quantization with zlib packaging
- Sliding-window evaluation with stride 64