PR #2140

open

Record: PR #2014 stack + LeakyReLU 0.3 + strict in-timer n-gram TTT (val_bpb 1.0560)

by simon-marcusView on GitHub

val_bpb

1.0560

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,997,965 bytes

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU-square MLP path with slope 0.3 instead of 0.5.

parameters: {"slope":0.3}

Test-Time Training

LoRA TTT

parameters: {"rank":80,"learning_rate":0.0001,"chunk_size":64,"short_chunk_size":32,"prefix_docs":2500,"score_first":true}

Other

other

Strict in-timer causal online n-gram tilt applied during TTT evaluation, with hint precomputation inside the measured eval timer.

parameters: {"ngram_tilt_enabled":true,"hint_precompute_outside":0,"token_order":16,"word_order":4}

Sequence Length

sequence_length

train_length: 3072

eval_length: 3072

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Regularization

weight decay

parameters: {"weight_decay":0.5}

logit softcap

parameters: {"asym_logit_rescale":1}

Quantization

GPTQ

bits: 8

scope: pergroup

Compression

pergroup

level: null

LeakyReLU-square slope changed from 0.5 to 0.3.
Strict in-timer online n-gram tilt during TTT evaluation.
Causal n-gram hint generation performed inside the measured eval timer.
PR #2014 stack combined with larger TTT chunks to fit hinting and scoring within the 600s eval budget.