PR #1739
openSubmission: SP8192 + Depth Recurrence + Muon 0.99 (1.1497 pre-quant BPB)
by DevelopedByAnuragView on GitHub
val_bpb
1.1497
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,077,239 bytes
Training Techniques
Architecture
depth recurrence
Re-runs transformer layers 4 and 5 during the forward pass to create a deeper virtual network without adding parameters.
parameters: {"layers":[4,5],"virtual_layers":11,"physical_layers":9}
SmearGate
Learned per-dimension sigmoid gate after the embedding layer that blends each token representation with its predecessor.
parameters: null
Optimizer
Muon
weight_decay: 0.085
momentum: 0.99
other_params: {"warmup_steps":1500,"warmup_start_momentum":0.85,"warmdown_steps":3000}
Weight Averaging
EMA
parameters: {"decay":0.996}
Evaluation
sliding window eval
parameters: {"stride":64}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Regularization
weight decay
parameters: {"muon_matrices":0.085,"embeddings":0.085,"adam_scalars":0.02}
Novel Contributions
- SP8192 SentencePiece vocabulary scaling
- Depth recurrence on layers 4 and 5
- Muon momentum tuning to 0.99 with warmup and warmdown schedules
- SmearGate embedding-level gating
- EMA weight averaging
- Sliding window evaluation with stride 64
- INT8 per-row quantization with zlib compression