PR #1527

open

Record: 6L depth minimalism U-Net sliding window - val_bpb 1.2025

by alphastar1111View on GitHub

val_bpb

1.2026

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.84 MB

Training Techniques

Architecture

U-Net skip connections

3 encoder + 3 decoder U-Net-style transformer with learned skip weights

parameters: {"encoders":3,"decoders":3,"layers":6}

weight tying

Untied input and output embeddings

parameters: null

MLP3x

Expanded MLP hidden size to 3x the model dimension

parameters: {"multiplier":3}

LeakyReLU

Uses LeakyReLU^2 activation with slope 0.5

parameters: {"slope":0.5}

KV head count

Full attention with 4 attention heads and 4 KV heads, no GQA

parameters: {"heads":4,"kv_heads":4}

residual mixing

Learnable residual mixing with input embedding x0

parameters: null

Regularization

logit softcap

parameters: {"value":12}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Compression

zlib

level: null

Quantization

int8

bits: 8

scope: all

LR Schedule

warmdown

parameters: {"warmdown_steps":250}

Novel Contributions

6-layer shallow transformer that matches the 9-layer baseline with fewer parameters
U-Net-style 3 encoder + 3 decoder architecture with learned skip weights
Full attention with 4 heads and 4 KV heads instead of GQA
Untied embeddings to trade parameters for capacity
Half-batch training with gradient accumulation for more optimizer steps per token
Tight logit softcap regularization
Long-context training and evaluation at sequence length 2048
Sliding window evaluation with stride 64
Int8 quantization with zlib compression to fit under 16MB