PR #920
open[Record Submission] - 74.3M Ternary U-Net Transformer (v2 - Continuation from #PR640)
by CiprianFlorin-IfrimView on GitHub
val_bpb
1.1539
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.95 MB
Training Techniques
Quantization
QAT
bits: 8
scope: FP8 path / model artifact
Architecture
U-Net skip connections
U-Net style skip connections added to the Transformer backbone.
parameters: null
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"num_heads":8,"num_kv_heads":4}
ReLU²
Uses relu2 activation in the MLP.
parameters: null
weight tying
Tied input and output embeddings.
parameters: null
RoPE
Uses YaRN-scaled rotary position embeddings.
parameters: {"type":"yarn","max_len":2048}
KV head count
Reduced KV head count relative to query heads.
parameters: {"num_kv_heads":4}
Optimizer
Muon
weight_decay: 0
momentum: 0.95
other_params: {"adam_lr":0.05,"adam_wd":0.05,"matrix_lr":0.04,"scalar_lr":0.02,"tied_embed_lr":0.02}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":16}
Sequence Length
sequence_length
train_length: 1024
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_fraction":0.15}
Regularization
logit softcap
parameters: {"type":"poly","value":10}
Novel Contributions
- BF16 scale storage for ternary dequantization scales, reducing roundtrip gap without increasing artifact size
- Increased embedding bottleneck from 254 to 312 to improve representation quality while staying under the 16MB artifact budget
- Adjusted warmdown fraction from 0.2 to 0.15 based on extended training experiments
- Improved validation BPB and cross-seed reproducibility over the original #640 submission