PR #1472

open

Add 1.2066 record: 8L Depth Recurrence by trhgbao

by trhgbaoView on GitHub
val_bpb
1.2066
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.73 MB

Training Techniques

Architecture
depth recurrence
8 physical layers are looped twice to form 16 logical layers.
parameters: {"physical_layers":8,"loops":2,"logical_layers":16}
LeakyReLU
Uses LeakyReLU(0.5)^2 instead of standard ReLU^2 to improve gradient flow for negative pre-activations.
parameters: {"negative_slope":0.5}
Partial RoPE
Applies rotary positional embeddings to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
weight tying
Untied input and output embeddings for greater expressivity.
parameters: null
depth embeddings
Learned embeddings are added at each logical depth step to provide depth awareness.
parameters: {"logical_layers":16}
XSA
Cross-layer state aggregation with zero-initialized skip connections across logical layers.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"muon_momentum_warmup_start":0.9,"muon_momentum_warmup_steps":1500}
Compression
zlib
level: null
Regularization
logit softcap
parameters: {"value":15}
LR Schedule
warmdown
parameters: {"warmdown_iters":3500,"warmup_steps":20}
Quantization
int8
bits: 8
scope: all

Novel Contributions

  • Depth recurrence with 8 physical layers looped into 16 logical layers
  • LeakyReLU(0.5)^2 activation to improve gradient flow
  • Partial RoPE on 1/4 of head dimensions
  • Untied input and output embeddings
  • Depth embeddings for logical layer awareness
  • Cross-layer state aggregation with zero-initialized skips
  • Muon optimization with hardware-aligned MLP multiplier
  • Int8 quantization plus zlib compression to fit the 16MB track