PR #1967

open

Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean)

val_bpb

1.0585

Architecture

Transformer

Optimizer

—

Artifact Size

15,949,305 bytes

Training Techniques

Quantization

GPTQ-lite

bits: null

scope: mixed-precision weights

Architecture

LeakyReLU

LeakyReLU squared slope 0.3 patch in the MLP/activation path.

parameters: {"slope":0.3}

weight tying

Not mentioned in the submission.

parameters: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.75}

Other

other

Closed-form causal n-gram tilt with three experts and Z renormalization, precomputed during validate() setup.

parameters: {"experts":3,"token_order":16,"within_doc":true,"word_order":4,"precompute_outside_validate":true}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Regularization

logit softcap

parameters: {"asymmetric_logit_rescale":true}

V21 stack composition with AWQ-Lite mixed-precision GPTQ and Asymmetric Logit Rescale
TTT/QK environment knob tuning including TTT_LR=0.75, QK_GAIN_INIT=5.25, and TTT_NO_QV_MASK=1
LeakyReLU slope 0.3 patch
Closed-form causal n-gram tilt with three experts and closed-form renormalization
Static n-gram hint table precomputed outside validate() with identical validation score