PR #76
open12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433)
by unixmadtoonslabView on GitHub
val_bpb
1.1433
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP weights int5 per-row, attention weights int6 per-row, fp16 embedding passthrough
Architecture
SmearGate
Per-dimension sigmoid gate blending token embedding with previous token embedding.
parameters: null
BigramHash
Hash embedding for token-pair context.
parameters: {"buckets":2048,"dim":96}
U-Net skip connections
Encoder-decoder split with learned per-dimension skip weights.
parameters: null
MLP3x
Wider MLP expansion enabled by compression savings.
parameters: {"multiplier":3}
KV head count
Grouped-query attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.98
other_params: {"lr":0.025}
Weight Averaging
SWA
parameters: {"interval":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":256}
Initialization
OrthoInit
Orthogonal initialization with 1/sqrt(2*num_layers) output projection scaling.
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":2000}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0}
Novel Contributions
- Mixed int5/int6 per-row quantization to save artifact space while preserving accuracy.
- 12-layer transformer enabled by int5 compression savings.
- SmearGate token-to-previous-token blending mechanism.
- BigramHash token-pair context embedding.
- U-Net style skip connections with learned per-dimension skip weights.
- Orthogonal initialization with scaled output projection.
- SWA checkpoint averaging during warmdown.
- Warmdown timing fix that ignores torch.compile overhead in step-time estimation.
- Sliding window evaluation with stride 256.