PR #1273

open

Non-record: Annealed Muon 1.58-bit Ternary — val_bpb 1.2196 (8xH100 SXM)

by DushyantChetiwalView on GitHub

val_bpb

1.2196

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.86 MB

Training Techniques

Quantization

QAT

bits: 1.58

scope: all

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"ns_steps":5}

Architecture

U-Net skip connections

Learned skip connections added to the model

parameters: null

XSA

Cross-sequence attention mechanism

parameters: null

BigramHash

Bigram hash embedding for token representation

parameters: {"buckets":2048,"dim":128}

SmearGate

Gating mechanism used in the architecture

parameters: null

ReLU²

MLP uses relu squared activation

parameters: {"mlp_multiplier":4}

KV head count

Uses 8 key-value heads

parameters: {"kv_heads":8}

RoPE

Full rotary positional embeddings

parameters: {"base":10000}

LR Schedule

hold-cosine

parameters: {"hold":0.7,"min_lr":0.01}

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

Training-time ternary quantization with annealed hardening via phi-exponent schedule
Muon optimizer applied to ternary QAT
Base-3 packing of ternary weights at 5 values per byte
Use of U-Net skip connections, XSA, BigramHash, and SmearGate in the model
Hold-cosine learning rate schedule tuned for ternary training