PR #923

open

[Notable Non-Record Submission] 1.1090 BPB - 74.3M Ternary U-Net Transformer (100k steps/3h)

by CiprianFlorin-IfrimView on GitHub
val_bpb
1.1090
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.95 MB

Training Techniques

Architecture
SmearGate
Learnable per-block gating for residual smoothing.
parameters: null
U-Net skip connections
U-Net style skip connections in the Transformer architecture.
parameters: null
weight tying
Tied embeddings between input and output embeddings.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
ReLU²
Uses relu2 activation in the MLP.
parameters: null
Optimizer
Muon
weight_decay: 0
momentum: 0.95
other_params: {"adam_lr":0.05,"adam_wd":0.05,"matrix_lr":0.04,"scalar_lr":0.02,"tied_embed_lr":0.02}
Weight Averaging
EMA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":16,"temperature":0.9}
Sequence Length
sequence_length
train_length: 1024
eval_length: 2048
LR Schedule
warmdown
parameters: {"fraction":0.15}
Regularization
logit softcap
parameters: {"value":10}
Quantization
QAT
bits: null
scope: all
Compression
lzma
level: null

Novel Contributions

  • Extended training of the ternary U-Net Transformer to 100k steps without a wallclock cap
  • Enabled SmearGate during extended training
  • Switched ternary scale storage from FP16 to BF16 to reduce roundtrip gap at longer training
  • Increased embedding dimension from 254 to 312 while staying within the 16MB artifact budget
  • Demonstrated improved scaling behavior and lower zero fraction with longer training