PR #730

open

Fix: move Ternary UNet submission folder from track_10min_16mb to track_non_record_16mb

by janwwwView on GitHub

val_bpb

1.1570

Architecture

Ternary U-Net Transformer

Optimizer

Muon

Artifact Size

15.99 MB

Training Techniques

Quantization

QAT

bits: 8

scope: fp_params / model artifact

ternary

bits: 2

scope: weights

Architecture

U-Net

U-Net encoder/decoder with learned skip weights and residual mixing on top of a Transformer backbone

parameters: {"layers":10,"dim":768,"heads":8,"kv_heads":4}

tied embeddings

Factored tied embedding with a 254-dimensional bottleneck

parameters: {"embed_dim":254,"vocab_size":8192}

RoPE

YaRN positional encoding variant

parameters: {"max_len":2048,"rope_base":5000}

MLP3x

4x relu² MLP expansion with fused gate and up projection

parameters: {"mlp_mult":4}

Optimizer

Muon

weight_decay: 0

momentum: 0.95

other_params: {"backend_steps":3,"momentum_warmup_start":0.85,"momentum_warmup_steps":500}

Evaluation

sliding window eval

parameters: {"stride":16}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"fraction":0.2}

Regularization

weight decay

parameters: {"adam_wd":0.05}

Initialization

ones-init

Learned skip weights initialized to ones

Compression

lzma

level: 9

Other

other

Base-3 packing for model compression

parameters: {"packing":"base-3 + LZMA"}

Novel Contributions

Ternary U-Net Transformer architecture
NeoMuon optimization for ternary STE gradient attenuation
4x relu² MLP expansion
Factored tied embedding with 254-dimensional bottleneck
YaRN positional encoding with 8192 BPE
FP8 QAT to reduce artifact size
Base-3 + LZMA compression
Sliding-window evaluation with stride 16