PR #184

closed

Record: Pre-Enrichment + Encoder Recurrence (val_bpb=1.1855)

by Idan3011View on GitHub
val_bpb
1.1855
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.75MB

Training Techniques

Architecture
pre-enrichment block
Two linear projections with GELU applied to embeddings before the transformer blocks to enrich representations.
parameters: {"layers":2,"dimensions":512}
depth recurrence
Encoder blocks are reused for a second pass with RMS norm stabilization between passes, increasing effective depth without adding parameters.
parameters: {"passes":2,"effective_layers":15,"physical_layers":10}
tied embeddings
Input and output embeddings are tied.
parameters: null
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: {"decoupled":true}
Quantization
int8
bits: 8
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
overtone embedding init
Non-standard embedding initialization used for the token embeddings.
LR Schedule
warmdown
parameters: {"warmdown_iters":2500,"warmup_steps":20}
Regularization
weight decay
parameters: {"value":0.02,"decoupled":true}
Compression
zlib
level: null

Novel Contributions

  • GELU pre-enrichment block before the transformer residual stream
  • 2x encoder recurrence with RMS norm stabilization between passes
  • Demonstrated that encoder recurrence outperformed additional training steps under the same time budget
  • Sliding window evaluation with stride 64
  • Overtone embedding initialization