PR #1527

open

Record: 6L depth minimalism U-Net sliding window - val_bpb 1.2025

by alphastar1111View on GitHub
val_bpb
1.2026
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.84 MB

Training Techniques

Architecture
U-Net skip connections
3 encoder + 3 decoder U-Net-style transformer with learned skip weights
parameters: {"encoders":3,"decoders":3,"layers":6}
weight tying
Untied input and output embeddings
parameters: null
MLP3x
Expanded MLP hidden size to 3x the model dimension
parameters: {"multiplier":3}
LeakyReLU
Uses LeakyReLU^2 activation with slope 0.5
parameters: {"slope":0.5}
KV head count
Full attention with 4 attention heads and 4 KV heads, no GQA
parameters: {"heads":4,"kv_heads":4}
residual mixing
Learnable residual mixing with input embedding x0
parameters: null
Regularization
logit softcap
parameters: {"value":12}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Compression
zlib
level: null
Quantization
int8
bits: 8
scope: all
LR Schedule
warmdown
parameters: {"warmdown_steps":250}

Novel Contributions

  • 6-layer shallow transformer that matches the 9-layer baseline with fewer parameters
  • U-Net-style 3 encoder + 3 decoder architecture with learned skip weights
  • Full attention with 4 heads and 4 KV heads instead of GQA
  • Untied embeddings to trade parameters for capacity
  • Half-batch training with gradient accumulation for more optimizer steps per token
  • Tight logit softcap regularization
  • Long-context training and evaluation at sequence length 2048
  • Sliding window evaluation with stride 64
  • Int8 quantization with zlib compression to fit under 16MB