PR #474
openNon-record: 6-Technique Stack — Catalytic Residuals + Value Residual + Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690)
by joshuaswarrenView on GitHub
val_bpb
1.1690
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3 MB
Training Techniques
Architecture
Catalytic Residuals
Residual connection of the form x + c * f(x) with learned per-dimension scalar c.
parameters: null
Value Residual
Caches layer-0 value vectors and mixes them into subsequent layers via learned scalars.
parameters: null
Gated Attention
Per-head sigmoid gate applied after attention output.
parameters: null
BigramHash
Hash-based bigram embedding with 10240 buckets.
parameters: {"buckets":10240}
MLP3x
3x expansion MLP instead of the baseline 2x expansion.
parameters: null
KV head count
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
depth
12-layer model.
parameters: {"layers":12}
Quantization
mixed int6/int8 QAT
bits: 6
scope: MLP and attention int6, embeddings int8
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
SWA
parameters: {"start_fraction":0.8}
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal weight initialization with muP-style projection scaling.
Regularization
weight decay
parameters: {"weight_decay":0.04}
LR Schedule
warmdown
parameters: null
Other
other
Late QAT with threshold 0.25 using STE int6 fake-quantization during warmdown.
parameters: {"threshold":0.25}
Compression
zstd
level: 22
Novel Contributions
- First submission to combine six independently proven architecture improvements in a single entry
- Catalytic Residuals
- Value Residual (ResFormer)
- Gated Attention
- BigramHash with 10240 buckets
- 12-layer model depth
- 3x MLP expansion