PR #1739

open

Submission: SP8192 + Depth Recurrence + Muon 0.99 (1.1497 pre-quant BPB)

by DevelopedByAnuragView on GitHub

val_bpb

1.1497

Architecture

Transformer

Optimizer

Muon

Artifact Size

16,077,239 bytes

Training Techniques

Architecture

depth recurrence

Re-runs transformer layers 4 and 5 during the forward pass to create a deeper virtual network without adding parameters.

parameters: {"layers":[4,5],"virtual_layers":11,"physical_layers":9}

SmearGate

Learned per-dimension sigmoid gate after the embedding layer that blends each token representation with its predecessor.

parameters: null

Optimizer

Muon

weight_decay: 0.085

momentum: 0.99

other_params: {"warmup_steps":1500,"warmup_start_momentum":0.85,"warmdown_steps":3000}

Weight Averaging

EMA

parameters: {"decay":0.996}

Evaluation

sliding window eval

parameters: {"stride":64}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Regularization

weight decay

parameters: {"muon_matrices":0.085,"embeddings":0.085,"adam_scalars":0.02}