PR #920

open

[Record Submission] - 74.3M Ternary U-Net Transformer (v2 - Continuation from #PR640)

by CiprianFlorin-IfrimView on GitHub

val_bpb

1.1539

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.95 MB

Training Techniques

Quantization

QAT

bits: 8

scope: FP8 path / model artifact

Architecture

U-Net skip connections

U-Net style skip connections added to the Transformer backbone.

parameters: null

GQA

Grouped query attention with 8 query heads and 4 KV heads.

parameters: {"num_heads":8,"num_kv_heads":4}

ReLU²

Uses relu2 activation in the MLP.

parameters: null

weight tying

Tied input and output embeddings.

parameters: null

RoPE

Uses YaRN-scaled rotary position embeddings.

parameters: {"type":"yarn","max_len":2048}

KV head count

Reduced KV head count relative to query heads.

parameters: {"num_kv_heads":4}

Optimizer

Muon

weight_decay: 0

momentum: 0.95

other_params: {"adam_lr":0.05,"adam_wd":0.05,"matrix_lr":0.04,"scalar_lr":0.02,"tied_embed_lr":0.02}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":16}

Sequence Length

sequence_length

train_length: 1024

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_fraction":0.15}

Regularization

logit softcap

parameters: {"type":"poly","value":10}

Novel Contributions

BF16 scale storage for ternary dequantization scales, reducing roundtrip gap without increasing artifact size
Increased embedding bottleneck from 254 to 312 to improve representation quality while staying under the 16MB artifact budget
Adjusted warmdown fraction from 0.2 to 0.15 based on extended training experiments
Improved validation BPB and cross-seed reproducibility over the original #640 submission