PR #1965
openRecord candidate: long-context no-QV rank56/prefix3000 TTT — val_bpb 1.05875
by himanshudongreView on GitHub
val_bpb
1.0587
Architecture
Transformer
Optimizer
—
Artifact Size
15,980,110 bytes
Training Techniques
Test-Time Training
score-first TTT
parameters: {"rank":56,"prefix_docs":3000,"num_phases":3,"mask":"no_qv","local_lr_mult":0.75,"qk_gain_init":5.25}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Quantization
GPTQ-lite
bits: null
scope: block weights
Architecture
SmearGate
BOS-fixed SmearGate used in the inherited stack
parameters: {"window":12}
Gated Attention
SparseAttnGate / gated attention component in the inherited stack
parameters: {"scale":0.5}
weight tying
Tied embeddings / weight tying not explicitly stated but implied by standard GPT-style setup
parameters: null
Regularization
logit softcap
parameters: null
weight decay
parameters: {"value":0.5}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Compression
pergroup
level: null
Other
other
CaseOps lossless tokenizer with validation byte-sidecar accounting
parameters: {"vocab_size":8192}
Novel Contributions
- Long-context no-QV TTT refinement with rank-56 / prefix-3000 compute reallocation
- 3-seed record-candidate rerun under strict 600s train/eval caps
- Score-first phased TTT budget tradeoff that improves some seeds while remaining reproducible
- CaseOps byte-sidecar validation accounting with full-vocabulary distribution over 8192 tokens