PR #2050
openAdd SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB)
by AidenGeunGeunView on GitHub
val_bpb
1.0608
Architecture
Transformer
Optimizer
—
Artifact Size
15,932,067 bytes
Training Techniques
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.000075}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Architecture
weight tying
Uses the official SP8192 CaseOps token alphabet and an eval-time normalized token-level causal n-gram Adaptive Hedge scoring overlay.
parameters: null
Other
other
Q/V LoRA disabled during eval-time TTT while K/MLP/O/lm_head remain active.
parameters: null
Novel Contributions
- Seed42 record-track proof at 1.06082922 BPB
- Three under-600-second seed proofs for reproducibility
- Eval-only follow-up from PR #1915 quantized artifacts
- 2560-token eval-time context with lower per-document TTT learning rate
- Q/V LoRA disabled during eval-time TTT; K/MLP/O/lm_head kept active
- Normalized token-level causal n-gram Adaptive Hedge scoring overlay over SP8192 CaseOps
- Strict-prefix n-gram state only with score-before-update per-document TTT
- Final package kept under the 16MB cap