PR #1077

open

Sota 11 l submission

val_bpb

1.1130

Architecture

Transformer

Optimizer

—

Artifact Size

15,998,200 bytes

Training Techniques

Architecture

U-Net skip connections

Symmetric skip connections between encoder and decoder blocks in an 11-layer U-Net Transformer.

parameters: {"layers":11,"skip_pairs":["0->5","1->6","2->7","3->8","4->9"]}

LeakyReLU

Uses LeakyReLU(0.5)^2 instead of standard ReLU^2 to avoid dead neurons and improve gradient flow.

parameters: {"slope":0.5}

XSA

Exclusive Self Attention applied in the last 4 layers to subtract attention components aligned with token embeddings.

parameters: {"layers":4}

Partial RoPE

Applies RoPE only to the first 16 dimensions of query/key heads, leaving the remaining dimensions position-free.

parameters: {"rope_dims":16,"total_dims":64}

VE128

Injects shared 128-dimensional value embeddings into the final blocks to stabilize logit projections.

parameters: {"dimensions":128,"blocks":[9,10]}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

magnitude pruning

parameters: {"prune_fraction":0.03}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_interval":50,"swa_start_fraction":0.5}

Quantization

STE QAT

bits: 6

scope: mixed; MLP int5, attention int6

GPTQ-lite

bits: 6

scope: per-row

Test-Time Training

full TTT

parameters: {"window_size":32768,"optimizer":"SGD"}