PR #194
openRecord: 11L Int6 QAT + SmearGate + SWA + SAM: 1.1480 BPB (3-seed mean)
by baudrillardsgh0stView on GitHub
val_bpb
1.1480
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.33 MiB
Training Techniques
Quantization
STE QAT
bits: 6
scope: all weights with fp16 tied embeddings
Architecture
SmearGate
Per-dimension learned gate blending current and previous token embeddings.
parameters: {"dimensions":512}
MLP3x
Expanded MLP hidden size to 3x the model dimension.
parameters: {"multiplier":3}
tied embeddings
Input embeddings and output projection are tied, with embeddings kept in fp16 passthrough.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.038
momentum: 0.99
other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500}
Weight Averaging
SWA
parameters: {"every_steps":50,"start_frac":0.5}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Initialization
OrthoInit
Orthogonal initialization used to support SmearGate and improve training stability.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"weight_decay":0.038}
Other
other
Sharpness-Aware Minimization (SAM) applied during training to flatten the loss landscape and improve quantization robustness.
parameters: {"rho":0.05,"frac":0.03}
Novel Contributions
- First introduction of SAM to the competition
- Per-dimension SmearGate with learned sigmoid gating over embedding dimensions
- Int6 QAT with int6 values stored in int8 containers for better zstd compression
- Combination of SWA and SAM to improve post-quantization robustness
- Use of sliding-window evaluation to recover additional BPB
- 11-layer architecture that fits under the artifact size limit with int6 compression