PR #474

open

Non-record: 6-Technique Stack — Catalytic Residuals + Value Residual + Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690)

by joshuaswarrenView on GitHub
val_bpb
1.1690
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3 MB

Training Techniques

Architecture
Catalytic Residuals
Residual connection of the form x + c * f(x) with learned per-dimension scalar c.
parameters: null
Value Residual
Caches layer-0 value vectors and mixes them into subsequent layers via learned scalars.
parameters: null
Gated Attention
Per-head sigmoid gate applied after attention output.
parameters: null
BigramHash
Hash-based bigram embedding with 10240 buckets.
parameters: {"buckets":10240}
MLP3x
3x expansion MLP instead of the baseline 2x expansion.
parameters: null
KV head count
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
depth
12-layer model.
parameters: {"layers":12}
Quantization
mixed int6/int8 QAT
bits: 6
scope: MLP and attention int6, embeddings int8
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
SWA
parameters: {"start_fraction":0.8}
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal weight initialization with muP-style projection scaling.
Regularization
weight decay
parameters: {"weight_decay":0.04}
LR Schedule
warmdown
parameters: null
Other
other
Late QAT with threshold 0.25 using STE int6 fake-quantization during warmdown.
parameters: {"threshold":0.25}
Compression
zstd
level: 22

Novel Contributions

  • First submission to combine six independently proven architecture improvements in a single entry
  • Catalytic Residuals
  • Value Residual (ResFormer)
  • Gated Attention
  • BigramHash with 10240 buckets
  • 12-layer model depth
  • 3x MLP expansion