PR #184

closed

Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1855)

val_bpb

1.1855

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.75MB

Training Techniques

Architecture

pre-enrichment block

Two linear projections with GELU applied to embeddings before the transformer blocks to enrich representations.

parameters: {"layers":2,"dimensions":512}

depth recurrence

Encoder blocks are reused for a second pass with RMS norm stabilization between passes, increasing effective depth without adding parameters.

parameters: {"passes":2,"effective_layers":15,"physical_layers":10}

tied embeddings

Input and output embeddings are tied.

parameters: null

Optimizer

Muon

weight_decay: 0.02

momentum: null

other_params: {"decoupled":true}

Quantization

int8

bits: 8

scope: all

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

overtone embedding init

Non-standard embedding initialization used for the token embeddings.

LR Schedule

warmdown

parameters: {"warmdown_iters":2500,"warmup_steps":20}

Regularization

weight decay

parameters: {"value":0.02,"decoupled":true}

Compression

zlib

level: null

GELU pre-enrichment block before the transformer residual stream
2x encoder recurrence with RMS norm stabilization between passes
Demonstrated that encoder recurrence outperformed additional training steps under the same time budget
Sliding window evaluation with stride 64
Overtone embedding initialization