PR #1822

open

Record: SP8192 + 9L Breadcrumb + EMA + StochDepth — val_bpb 1.17845772 (legal)

by UnwindologyView on GitHub

val_bpb

1.1785

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,897,032 bytes

Training Techniques

Architecture

weight tying

Tied embeddings between input and output embeddings.

parameters: null

Partial RoPE

Uses partial rotary positional embeddings.

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"layers":9,"dimensions":512,"heads":8,"kv_heads":4}

Regularization

logit softcap

parameters: {"value":30}

dropout

parameters: {"type":"stochastic depth","expected_value_scaling":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"newton_schulz_steps":5,"warmup_momentum":0.85,"warmup_steps":500,"scope":"matrix weights only"}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"embeddings and scalars"}

Quantization

int6

bits: 6

scope: all

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_steps":1200}

Novel Contributions

SP8192 tokenizer with byte fallback and expanded 8192 vocabulary
Breadcrumb gating on MLP residual contributions
EMA combined with stochastic depth for a 600-second wallclock regime
Muon optimizer for matrix weights with AdamW for embeddings and scalars
Int6 quantization plus zlib compression to fit under the 16 MB artifact cap
Single-seed record run reaching val_bpb 1.17845772