PR #2050

open

Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB)

by AidenGeunGeunView on GitHub
val_bpb
1.0608
Architecture
Transformer
Optimizer
Artifact Size
15,932,067 bytes

Training Techniques

Test-Time Training
score-first TTT
parameters: {"learning_rate":0.000075}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Architecture
weight tying
Uses the official SP8192 CaseOps token alphabet and an eval-time normalized token-level causal n-gram Adaptive Hedge scoring overlay.
parameters: null
Other
other
Q/V LoRA disabled during eval-time TTT while K/MLP/O/lm_head remain active.
parameters: null

Novel Contributions

  • Seed42 record-track proof at 1.06082922 BPB
  • Three under-600-second seed proofs for reproducibility
  • Eval-only follow-up from PR #1915 quantized artifacts
  • 2560-token eval-time context with lower per-document TTT learning rate
  • Q/V LoRA disabled during eval-time TTT; K/MLP/O/lm_head kept active
  • Normalized token-level causal n-gram Adaptive Hedge scoring overlay over SP8192 CaseOps
  • Strict-prefix n-gram state only with score-before-update per-document TTT
  • Final package kept under the 16MB cap