PR #997

open

Non-record: 24.7M params · int6 · Binary U-Net/SmearGate/BigramHash · 1.5hr · RTX 5060 Ti 16GB

by randy06122001-boopView on GitHub
val_bpb
1.4182
Architecture
Transformer
Optimizer
Muon
Artifact Size
11.63MB

Training Techniques

Quantization
int6
bits: 6
scope: block weights
Architecture
U-Net skip connections
10-layer U-Net style transformer with 5 encoder and 5 decoder blocks
parameters: {"layers":10,"encoder_blocks":5,"decoder_blocks":5}
SmearGate
Causal blending of token embeddings with previous context
parameters: null
BigramHash
4096-bucket hash embedding for consecutive token pairs
parameters: {"buckets":4096}
MLP3x
3x MLP expansion with ReLU² activation
parameters: {"hidden":1536}
GQA
Grouped query attention with 4 KV heads
parameters: {"heads":8,"kv_heads":4,"dimension":512}
weight tying
Tied 1024-vocab embedding
parameters: {"vocab_size":1024}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"newton_schulz_steps":5}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"scalar parameters and embeddings"}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":20,"phase":"warmdown"}
Initialization
OrthoInit
Orthogonal initialization for all matrix weights
Compression
zstd
level: 22

Novel Contributions

  • Int6 quantization for block weights
  • Binary U-Net style transformer with 10 layers
  • SmearGate causal embedding blending
  • BigramHash token-pair hash embeddings
  • Muon optimization with SWA
  • ReLU² MLP expansion
  • Tied embeddings with GQA