PR #923

open

[Notable Non-Record Submission] 1.1090 BPB - 74.3M Ternary U-Net Transformer (100k steps/3h)

by CiprianFlorin-IfrimView on GitHub

val_bpb

1.1090

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.95 MB

Training Techniques

Architecture

SmearGate

Learnable per-block gating for residual smoothing.

parameters: null

U-Net skip connections

U-Net style skip connections in the Transformer architecture.

parameters: null

weight tying

Tied embeddings between input and output embeddings.

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

ReLU²

Uses relu2 activation in the MLP.

parameters: null

Optimizer

Muon

weight_decay: 0

momentum: 0.95

other_params: {"adam_lr":0.05,"adam_wd":0.05,"matrix_lr":0.04,"scalar_lr":0.02,"tied_embed_lr":0.02}

Weight Averaging

EMA

parameters: null

Evaluation

sliding window eval

parameters: {"stride":16,"temperature":0.9}

Sequence Length

sequence_length

train_length: 1024

eval_length: 2048

LR Schedule

warmdown

parameters: {"fraction":0.15}

Regularization

logit softcap

parameters: {"value":10}

Quantization

QAT

bits: null

scope: all

Compression

lzma

level: null

Novel Contributions

Extended training of the ternary U-Net Transformer to 100k steps without a wallclock cap
Enabled SmearGate during extended training
Switched ternary scale storage from FP16 to BF16 to reduce roundtrip gap at longer training
Increased embedding dimension from 254 to 312 while staying within the 16MB artifact budget
Demonstrated improved scaling behavior and lower zero fraction with longer training