PR #1273

open

Non-record: Annealed Muon 1.58-bit Ternary — val_bpb 1.2196 (8xH100 SXM)

by DushyantChetiwalView on GitHub
val_bpb
1.2196
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.86 MB

Training Techniques

Quantization
QAT
bits: 1.58
scope: all
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"ns_steps":5}
Architecture
U-Net skip connections
Learned skip connections added to the model
parameters: null
XSA
Cross-sequence attention mechanism
parameters: null
BigramHash
Bigram hash embedding for token representation
parameters: {"buckets":2048,"dim":128}
SmearGate
Gating mechanism used in the architecture
parameters: null
ReLU²
MLP uses relu squared activation
parameters: {"mlp_multiplier":4}
KV head count
Uses 8 key-value heads
parameters: {"kv_heads":8}
RoPE
Full rotary positional embeddings
parameters: {"base":10000}
LR Schedule
hold-cosine
parameters: {"hold":0.7,"min_lr":0.01}
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Training-time ternary quantization with annealed hardening via phi-exponent schedule
  • Muon optimizer applied to ternary QAT
  • Base-3 packing of ternary weights at 5 values per byte
  • Use of U-Net skip connections, XSA, BigramHash, and SmearGate in the model
  • Hold-cosine learning rate schedule tuned for ternary training