PR #1941
openRecord: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean)
by MarioPaerleView on GitHub
val_bpb
1.0687
Architecture
Transformer
Optimizer
—
Artifact Size
~15.9 MB
Training Techniques
Architecture
Gated Attention
Per-block MLP output gate using a learnable linear gate on the MLP output, applied token-wise with sigmoid gating.
parameters: {"layers":11,"gate_input_dim":12,"gate_output_dim":1}
Quantization
GPTQ
bits: null
scope: artifact/model
Test-Time Training
LoRA TTT
parameters: {"rank":128,"alpha":144}
Regularization
weight decay
parameters: {"weight_decay":2}
layerwise LN scale
parameters: null
Initialization
resid mix
Gate initialized with weight 0 and bias +5 so sigmoid output starts near identity (~0.993), minimizing initial disruption.
Other
other
Phased test-time training with score-before-update behavior; each chunk is scored under inference_mode() before any LoRA update.
parameters: {"phases":3,"prefix_docs":2000}
Novel Contributions
- Per-block MLP output gate that is input-dependent and weight-learnable
- Token-wise gating of MLP outputs rather than head-wise gating
- Bug fix making the gate weight-learnable in the modern PR #1886 stack
- Do-no-harm gate initialization with weight 0 and bias +5