PR #1941

open

Record: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean)

by MarioPaerleView on GitHub
val_bpb
1.0687
Architecture
Transformer
Optimizer
Artifact Size
~15.9 MB

Training Techniques

Architecture
Gated Attention
Per-block MLP output gate using a learnable linear gate on the MLP output, applied token-wise with sigmoid gating.
parameters: {"layers":11,"gate_input_dim":12,"gate_output_dim":1}
Quantization
GPTQ
bits: null
scope: artifact/model
Test-Time Training
LoRA TTT
parameters: {"rank":128,"alpha":144}
Regularization
weight decay
parameters: {"weight_decay":2}
layerwise LN scale
parameters: null
Initialization
resid mix
Gate initialized with weight 0 and bias +5 so sigmoid output starts near identity (~0.993), minimizing initial disruption.
Other
other
Phased test-time training with score-before-update behavior; each chunk is scored under inference_mode() before any LoRA update.
parameters: {"phases":3,"prefix_docs":2000}

Novel Contributions

  • Per-block MLP output gate that is input-dependent and weight-learnable
  • Token-wise gating of MLP outputs rather than head-wise gating
  • Bug fix making the gate weight-learnable in the modern PR #1886 stack
  • Do-no-harm gate initialization with weight 0 and bias +5