PR #507

open

Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64

by skarakulakView on GitHub

val_bpb

1.1558

Architecture

11-layer Transformer with gated U-Net skip connections

Optimizer

Muon + Adam

Artifact Size

15.1 MB

Training Techniques

Architecture

gated U-Net skip connections

Sigmoid-gated blending between encoder and decoder layers

parameters: {"layers":11,"encoder_layers":5,"mid_layers":1,"decoder_layers":5}

Catalytic residuals

Learned per-dimension gates on attention and MLP outputs, initialized to 1.0

parameters: null

SwiGLU MLP

Gated linear unit with SiLU activation and 3× expansion factor

parameters: {"expansion_factor":3}

Value residual (ResFormer)

Blend first-layer value vectors into all subsequent layers for better gradient flow

parameters: null

BigramHash

Bigram-conditioned token embeddings via hash-based lookup

parameters: {"buckets":4096,"embedding_dim":128}

Partial RoPE

Rotary positional embeddings applied to 25% of head dimensions

parameters: {"percentage":25}

XSA

Cross-self attention applied on last 4 layers with gated attention

parameters: {"layers":4}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)","applied_to":"RMSNorm inputs"}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: null

Adam

weight_decay: null

momentum: null

other_params: {"applied_to":"scalar parameters"}

Quantization

mixed int5/int6

bits: null

scope: MLP and bigram weights at 5-bit, rest at 6-bit

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":1024}

Weight Averaging

EMA

parameters: {"decay":0.9985}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Other

other

Decoder learning rate multiplier of 2× applied for both Muon and Adam optimizers

parameters: {"multiplier":2}

Novel Contributions

Use of 11-layer Transformer with gated U-Net skip connections for blending encoder and decoder layers
Introduction of Catalytic residuals with learned per-dimension gates on attention and MLP outputs
Application of SwiGLU MLP with 3× expansion factor
Value residual blending first-layer value vectors into all subsequent layers (ResFormer style)
Layerwise LN scale dampening with 1/sqrt(layer_idx+1) on RMSNorm inputs
Decoder learning rate multiplier of 2× for Muon and Adam optimizers
Mixed int5/int6 quantization combined with zstd-22 compression
Sliding window evaluation with stride 64 and sequence length 1024 for improved val_bpb
BigramHash embeddings with 4096 buckets and 128 dimensions
Partial RoPE applied to 25% of head dimensions
Cross-self attention (XSA) with gated attention on last 4 layers
Use of EMA with decay 0.9985 for weight averaging