PR #1246

open

Record: Trinity Ternary GPT — val_bpb 0.9650 (ternary roundtrip)

by deborahnelson8788726View on GitHub

val_bpb

0.9650

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.2MB

Training Techniques

Quantization

QAT

bits: null

scope: all large weight matrices

ternary

bits: null

scope: all large weight matrices

Architecture

GQA

Uses grouped query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP

Uses 4x MLP expansion with relu² activation.

parameters: {"expansion":4}

ReLU²

Squared ReLU activation in the MLP.

parameters: null

Partial RoPE

Applies rotary position embeddings to only part of the head dimension.

parameters: {"dimensions":16,"total_dimensions":96}

U-Net skip connections

Adds learned skip connections between layers.

parameters: null

Optimizer

Muon

weight_decay: 0

momentum: null

other_params: {"neoMuon":true,"newton_schulz_steps":3}

Weight Averaging

EMA

parameters: {"decay":0.997,"start_step":500}

Regularization

weight decay

parameters: {"value":0}

logit softcap

parameters: {"z_loss":0.0001}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Compression

lzma

level: 9

Novel Contributions

BitNet b1.58-style ternary QAT with absmean scaling
Base-3 ternary packing with 5 trits per byte
Trinity-inspired ternary roundtrip compression pipeline
10-layer Transformer with GQA, ReLU², Partial RoPE, and U-Net skip connections
NeoMuon optimizer variant with fewer Newton-Schulz steps
EMA training and Z-loss regularization