PR #1965

open

Record candidate: long-context no-QV rank56/prefix3000 TTT — val_bpb 1.05875

by himanshudongreView on GitHub

val_bpb

1.0587

Architecture

Transformer

Optimizer

—

Artifact Size

15,980,110 bytes

Training Techniques

Test-Time Training

score-first TTT

parameters: {"rank":56,"prefix_docs":3000,"num_phases":3,"mask":"no_qv","local_lr_mult":0.75,"qk_gain_init":5.25}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Quantization

GPTQ-lite

bits: null

scope: block weights

Architecture

SmearGate

BOS-fixed SmearGate used in the inherited stack

parameters: {"window":12}

Gated Attention

SparseAttnGate / gated attention component in the inherited stack

parameters: {"scale":0.5}

weight tying

Tied embeddings / weight tying not explicitly stated but implied by standard GPT-style setup

parameters: null

Regularization

logit softcap

parameters: null

weight decay

parameters: {"value":0.5}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Compression

pergroup

level: null

Other

other

CaseOps lossless tokenizer with validation byte-sidecar accounting

parameters: {"vocab_size":8192}

Novel Contributions

Long-context no-QV TTT refinement with rank-56 / prefix-3000 compute reallocation
3-seed record-candidate rerun under strict 600s train/eval caps
Score-first phased TTT budget tradeoff that improves some seeds while remaining reproducible
CaseOps byte-sidecar validation accounting with full-vocabulary distribution over 8192 tokens