PR #2140
openRecord: PR #2014 stack + LeakyReLU 0.3 + strict in-timer n-gram TTT (val_bpb 1.0560)
by simon-marcusView on GitHub
val_bpb
1.0560
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,997,965 bytes
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU-square MLP path with slope 0.3 instead of 0.5.
parameters: {"slope":0.3}
Test-Time Training
LoRA TTT
parameters: {"rank":80,"learning_rate":0.0001,"chunk_size":64,"short_chunk_size":32,"prefix_docs":2500,"score_first":true}
Other
other
Strict in-timer causal online n-gram tilt applied during TTT evaluation, with hint precomputation inside the measured eval timer.
parameters: {"ngram_tilt_enabled":true,"hint_precompute_outside":0,"token_order":16,"word_order":4}
Sequence Length
sequence_length
train_length: 3072
eval_length: 3072
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Regularization
weight decay
parameters: {"weight_decay":0.5}
logit softcap
parameters: {"asym_logit_rescale":1}
Quantization
GPTQ
bits: 8
scope: pergroup
Compression
pergroup
level: null
Novel Contributions
- LeakyReLU-square slope changed from 0.5 to 0.3.
- Strict in-timer online n-gram tilt during TTT evaluation.
- Causal n-gram hint generation performed inside the measured eval timer.
- PR #2014 stack combined with larger TTT chunks to fit hinting and scoring within the 600s eval budget.