PR #1880

open

Record: AttnOutGate + SmearGate + Softcap 15 — val_bpb 1.07750 (3-seed mean)

by Meirzhan05View on GitHub

val_bpb

1.0775

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

AttnOutGate

Per-head data-dependent gate on SDPA output before output projection.

parameters: {"heads":8,"width":12,"layers":11}

SmearGate

Forward-1-token residual mixer at the embedding lane with BOS masking to prevent cross-document leakage.

parameters: {"width":12}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

depth recurrence

Layers 3-5 are looped multiple times to create virtual layers.

parameters: {"layers":[3,4,5],"loops":3,"virtual_layers":17}

XSA

XSA applied across all layers.

parameters: {"layers":11}

U-Net skip connections

U-Net style skip connections with learnable gates.

parameters: null

LeakyReLU

MLP uses LeakyReLU squared activation.

parameters: {"slope":0.5}

Regularization

logit softcap

parameters: {"cap":15}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"5_step_NS":true}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeds/scalars"}

SGD

weight_decay: null

momentum: 0.9

other_params: {"used_for":"TTT"}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"chunk_size":32000}

Quantization

GPTQ

bits: null

scope: full model

Compression

lzma

level: null

brotli

level: 11

Evaluation

sliding window eval

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_pct":72}

linear warmup

parameters: {"warmup_steps":20}

Novel Contributions

AttnOutGate per-head gating on attention outputs
SmearGate forward-1-token residual mixing with BOS masking
Lower logit softcap from 30 to 15
Three additive zero-cost modifications combined to reach a new record