PR #1482

open

Record: SP8192 + Pre-Quant TTT (QK 5.25, 8ep, freeze-1) — val_bpb 1.0787 (3-seed mean)

by aamodbhattView on GitHub
val_bpb
1.0787
Architecture
Transformer
Optimizer
Artifact Size
16,000,000 bytes

Training Techniques

Quantization
GPTQ
bits: null
scope: model weights
Architecture
depth recurrence
Uses the SP8192 recurrence pipeline as part of the model stack.
parameters: null
Test-Time Training
full TTT
parameters: {"epochs":8,"learning_rate":0.00045,"freeze_blocks":1}
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
weight decay
parameters: null

Novel Contributions

  • SP8192 pre-quant TTT lane with tuned QK gain initialization
  • Test-time training with 8 epochs, learning rate 0.00045, and freezing 1 block
  • 3-seed confirmation of improved sliding-window validation bpb
  • Use of sliding window evaluation with stride 64