PR #216
openTernary Universal Transformer — 15.6MB, bfloat16, Muon optimizerAdd ternary Universal Transformer submission
by alons23View on GitHub
val_bpb
0.8100
Architecture
Universal Transformer
Optimizer
Muon
Artifact Size
15.6MB
Training Techniques
Quantization
ternary
bits: null
scope: weights
Architecture
depth recurrence
Universal Transformer with repeated recurrence over blocks; 4 blocks and 6 recurrences for 24 effective layers.
parameters: {"blocks":4,"recurrences":6,"effective_layers":24}
QK-Norm
Normalization applied to query/key projections.
parameters: null
RoPE
Rotary positional embeddings used in attention.
parameters: null
FlashAttention-2
Uses FlashAttention-2 for efficient attention computation.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Other
other
Uses bfloat16 precision for training/inference.
parameters: {"precision":"bfloat16"}
Novel Contributions
- Ternary Universal Transformer submission
- Ternary weights in {-1, 0, +1}
- Muon optimizer
- Universal Transformer with 4 blocks and 6 recurrences (24 effective layers)
- QK-Norm
- RoPE
- FlashAttention-2
- bfloat16 artifact