PR #507

open

Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64

by skarakulakView on GitHub
val_bpb
1.1558
Architecture
11-layer Transformer with gated U-Net skip connections
Optimizer
Muon + Adam
Artifact Size
15.1 MB

Training Techniques

Architecture
gated U-Net skip connections
Sigmoid-gated blending between encoder and decoder layers
parameters: {"layers":11,"encoder_layers":5,"mid_layers":1,"decoder_layers":5}
Catalytic residuals
Learned per-dimension gates on attention and MLP outputs, initialized to 1.0
parameters: null
SwiGLU MLP
Gated linear unit with SiLU activation and 3× expansion factor
parameters: {"expansion_factor":3}
Value residual (ResFormer)
Blend first-layer value vectors into all subsequent layers for better gradient flow
parameters: null
BigramHash
Bigram-conditioned token embeddings via hash-based lookup
parameters: {"buckets":4096,"embedding_dim":128}
Partial RoPE
Rotary positional embeddings applied to 25% of head dimensions
parameters: {"percentage":25}
XSA
Cross-self attention applied on last 4 layers with gated attention
parameters: {"layers":4}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)","applied_to":"RMSNorm inputs"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
Adam
weight_decay: null
momentum: null
other_params: {"applied_to":"scalar parameters"}
Quantization
mixed int5/int6
bits: null
scope: MLP and bigram weights at 5-bit, rest at 6-bit
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":1024}
Weight Averaging
EMA
parameters: {"decay":0.9985}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
Decoder learning rate multiplier of 2× applied for both Muon and Adam optimizers
parameters: {"multiplier":2}

Novel Contributions

  • Use of 11-layer Transformer with gated U-Net skip connections for blending encoder and decoder layers
  • Introduction of Catalytic residuals with learned per-dimension gates on attention and MLP outputs
  • Application of SwiGLU MLP with 3× expansion factor
  • Value residual blending first-layer value vectors into all subsequent layers (ResFormer style)
  • Layerwise LN scale dampening with 1/sqrt(layer_idx+1) on RMSNorm inputs
  • Decoder learning rate multiplier of 2× for Muon and Adam optimizers
  • Mixed int5/int6 quantization combined with zstd-22 compression
  • Sliding window evaluation with stride 64 and sequence length 1024 for improved val_bpb
  • BigramHash embeddings with 4096 buckets and 128 dimensions
  • Partial RoPE applied to 25% of head dimensions
  • Cross-self attention (XSA) with gated attention on last 4 layers
  • Use of EMA with decay 0.9985 for weight averaging