PR #1724

open

SP8192 + 9-Layer + Breadcrumb Gating + EMA + Stochastic Depth - 1.1803 BPB (legal)

by UnwindologyView on GitHub

val_bpb

1.1803

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,880,130 bytes

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

GQA

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Uses partial rotary positional embeddings.

parameters: null

breadcrumb gating

Learned sigmoid gate on each MLP contribution for residual regularization.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

stochastic depth

parameters: null

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"newton_schulz_steps":5,"warmup_momentum_start":0.85,"warmup_steps":500,"adamw_for":["embeddings","scalars"]}

Quantization

int6

bits: 6

scope: all

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_steps":1200}

Novel Contributions

SP8192 tokenizer with byte fallback
Breadcrumb gating on MLP residual contributions
EMA weight averaging with decay 0.997
Stochastic depth regularization
Muon optimizer with Newton-Schulz updates
Int6 quantization with zlib packaging
Sliding-window evaluation with stride 64