PR #1246

open

Record: Trinity Ternary GPT — val_bpb 0.9650 (ternary roundtrip)

by deborahnelson8788726View on GitHub
val_bpb
0.9650
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.2MB

Training Techniques

Quantization
QAT
bits: null
scope: all large weight matrices
ternary
bits: null
scope: all large weight matrices
Architecture
GQA
Uses grouped query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP
Uses 4x MLP expansion with relu² activation.
parameters: {"expansion":4}
ReLU²
Squared ReLU activation in the MLP.
parameters: null
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"dimensions":16,"total_dimensions":96}
U-Net skip connections
Adds learned skip connections between layers.
parameters: null
Optimizer
Muon
weight_decay: 0
momentum: null
other_params: {"neoMuon":true,"newton_schulz_steps":3}
Weight Averaging
EMA
parameters: {"decay":0.997,"start_step":500}
Regularization
weight decay
parameters: {"value":0}
logit softcap
parameters: {"z_loss":0.0001}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
lzma
level: 9

Novel Contributions

  • BitNet b1.58-style ternary QAT with absmean scaling
  • Base-3 ternary packing with 5 trits per byte
  • Trinity-inspired ternary roundtrip compression pipeline
  • 10-layer Transformer with GQA, ReLU², Partial RoPE, and U-Net skip connections
  • NeoMuon optimizer variant with fewer Newton-Schulz steps
  • EMA training and Z-loss regularization