PR #1967

open

Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean)

by ndokutovichView on GitHub
val_bpb
1.0585
Architecture
Transformer
Optimizer
Artifact Size
15,949,305 bytes

Training Techniques

Quantization
GPTQ-lite
bits: null
scope: mixed-precision weights
Architecture
LeakyReLU
LeakyReLU squared slope 0.3 patch in the MLP/activation path.
parameters: {"slope":0.3}
weight tying
Not mentioned in the submission.
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.75}
Other
other
Closed-form causal n-gram tilt with three experts and Z renormalization, precomputed during validate() setup.
parameters: {"experts":3,"token_order":16,"within_doc":true,"word_order":4,"precompute_outside_validate":true}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Regularization
logit softcap
parameters: {"asymmetric_logit_rescale":true}

Novel Contributions

  • V21 stack composition with AWQ-Lite mixed-precision GPTQ and Asymmetric Logit Rescale
  • TTT/QK environment knob tuning including TTT_LR=0.75, QK_GAIN_INIT=5.25, and TTT_NO_QV_MASK=1
  • LeakyReLU slope 0.3 patch
  • Closed-form causal n-gram tilt with three experts and closed-form renormalization
  • Static n-gram hint table precomputed outside validate() with identical validation score