PR #1518

open

Record: Wider Loop + Per-Pass Embeddings + Tap-In V6 + Legal TTT (1.078825 3-seed mean)

by abaybektursunView on GitHub
val_bpb
1.0788
Architecture
Transformer
Optimizer
Artifact Size
15,977,457 bytes

Training Techniques

Architecture
depth recurrence
Wider loop recurrence with 3 passes through 3 loop blocks instead of 4 passes through 2.
parameters: {"LOOP_START":3,"LOOP_END":5,"NUM_LOOPS":2,"passes":3,"loop_blocks":3}
loop embeddings
Per-pass learned loop embeddings, zero-initialized and fired at the start of each pass.
parameters: {"num_embeddings":3,"dimension":512,"init":"zero"}
Regularization
Hessian clipping
parameters: {"lambda":0}
Evaluation
Tap-In V6 cross-window
parameters: {"bigram_idf_rule":true,"cross_window":true}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"freeze_blocks":0,"epochs":3,"chunk_tokens":32768}
Quantization
int6
bits: 6
scope: model

Novel Contributions

  • Wider depth recurrence with more loop block executions
  • Per-pass learned loop embeddings
  • Pinning Hessian clip lambda to 0 after a failed default value
  • Tap-In V6 cross-window evaluation with bigram-IDF matching
  • Legal score-first test-time training stacked on Tap-In V6