PR #1967
openRecord: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean)
by ndokutovichView on GitHub
val_bpb
1.0585
Architecture
Transformer
Optimizer
—
Artifact Size
15,949,305 bytes
Training Techniques
Quantization
GPTQ-lite
bits: null
scope: mixed-precision weights
Architecture
LeakyReLU
LeakyReLU squared slope 0.3 patch in the MLP/activation path.
parameters: {"slope":0.3}
weight tying
Not mentioned in the submission.
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.75}
Other
other
Closed-form causal n-gram tilt with three experts and Z renormalization, precomputed during validate() setup.
parameters: {"experts":3,"token_order":16,"within_doc":true,"word_order":4,"precompute_outside_validate":true}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Regularization
logit softcap
parameters: {"asymmetric_logit_rescale":true}
Novel Contributions
- V21 stack composition with AWQ-Lite mixed-precision GPTQ and Asymmetric Logit Rescale
- TTT/QK environment knob tuning including TTT_LR=0.75, QK_GAIN_INIT=5.25, and TTT_NO_QV_MASK=1
- LeakyReLU slope 0.3 patch
- Closed-form causal n-gram tilt with three experts and closed-form renormalization
- Static n-gram hint table precomputed outside validate() with identical validation score