PR #1246
openRecord: Trinity Ternary GPT — val_bpb 0.9650 (ternary roundtrip)
by deborahnelson8788726View on GitHub
val_bpb
0.9650
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.2MB
Training Techniques
Quantization
QAT
bits: null
scope: all large weight matrices
ternary
bits: null
scope: all large weight matrices
Architecture
GQA
Uses grouped query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP
Uses 4x MLP expansion with relu² activation.
parameters: {"expansion":4}
ReLU²
Squared ReLU activation in the MLP.
parameters: null
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"dimensions":16,"total_dimensions":96}
U-Net skip connections
Adds learned skip connections between layers.
parameters: null
Optimizer
Muon
weight_decay: 0
momentum: null
other_params: {"neoMuon":true,"newton_schulz_steps":3}
Weight Averaging
EMA
parameters: {"decay":0.997,"start_step":500}
Regularization
weight decay
parameters: {"value":0}
logit softcap
parameters: {"z_loss":0.0001}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
lzma
level: 9
Novel Contributions
- BitNet b1.58-style ternary QAT with absmean scaling
- Base-3 ternary packing with 5 trits per byte
- Trinity-inspired ternary roundtrip compression pipeline
- 10-layer Transformer with GQA, ReLU², Partial RoPE, and U-Net skip connections
- NeoMuon optimizer variant with fewer Newton-Schulz steps
- EMA training and Z-loss regularization