PR #997
openNon-record: 24.7M params · int6 · Binary U-Net/SmearGate/BigramHash · 1.5hr · RTX 5060 Ti 16GB
by randy06122001-boopView on GitHub
val_bpb
1.4182
Architecture
Transformer
Optimizer
Muon
Artifact Size
11.63MB
Training Techniques
Quantization
int6
bits: 6
scope: block weights
Architecture
U-Net skip connections
10-layer U-Net style transformer with 5 encoder and 5 decoder blocks
parameters: {"layers":10,"encoder_blocks":5,"decoder_blocks":5}
SmearGate
Causal blending of token embeddings with previous context
parameters: null
BigramHash
4096-bucket hash embedding for consecutive token pairs
parameters: {"buckets":4096}
MLP3x
3x MLP expansion with ReLU² activation
parameters: {"hidden":1536}
GQA
Grouped query attention with 4 KV heads
parameters: {"heads":8,"kv_heads":4,"dimension":512}
weight tying
Tied 1024-vocab embedding
parameters: {"vocab_size":1024}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"newton_schulz_steps":5}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"scalar parameters and embeddings"}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":20,"phase":"warmdown"}
Initialization
OrthoInit
Orthogonal initialization for all matrix weights
Compression
zstd
level: 22
Novel Contributions
- Int6 quantization for block weights
- Binary U-Net style transformer with 10 layers
- SmearGate causal embedding blending
- BigramHash token-pair hash embeddings
- Muon optimization with SWA
- ReLU² MLP expansion
- Tied embeddings with GQA