PR #2050

open

Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB)

by AidenGeunGeunView on GitHub

val_bpb

1.0608

Architecture

Transformer

Optimizer

—

Artifact Size

15,932,067 bytes

Training Techniques

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.000075}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Architecture

weight tying

Uses the official SP8192 CaseOps token alphabet and an eval-time normalized token-level causal n-gram Adaptive Hedge scoring overlay.

parameters: null

Other

other

Q/V LoRA disabled during eval-time TTT while K/MLP/O/lm_head remain active.

parameters: null

Seed42 record-track proof at 1.06082922 BPB
Three under-600-second seed proofs for reproducibility
Eval-only follow-up from PR #1915 quantized artifacts
2560-token eval-time context with lower per-document TTT learning rate
Q/V LoRA disabled during eval-time TTT; K/MLP/O/lm_head kept active
Normalized token-level causal n-gram Adaptive Hedge scoring overlay over SP8192 CaseOps
Strict-prefix n-gram state only with score-before-update per-document TTT
Final package kept under the 16MB cap