PR #1941

open

Record: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean)

val_bpb

1.0687

Architecture

Transformer

Optimizer

—

Artifact Size

~15.9 MB

Training Techniques

Architecture

Gated Attention

Per-block MLP output gate using a learnable linear gate on the MLP output, applied token-wise with sigmoid gating.

parameters: {"layers":11,"gate_input_dim":12,"gate_output_dim":1}

Quantization

GPTQ

bits: null

scope: artifact/model

Test-Time Training

LoRA TTT

parameters: {"rank":128,"alpha":144}

Regularization

weight decay

parameters: {"weight_decay":2}

layerwise LN scale

parameters: null

Initialization

resid mix

Gate initialized with weight 0 and bias +5 so sigmoid output starts near identity (~0.993), minimizing initial disruption.

Other

other

Phased test-time training with score-before-update behavior; each chunk is scored under inference_mode() before any LoRA update.

parameters: {"phases":3,"prefix_docs":2000}