PR #1398

open

Non-Record: SP1024 + Depth Recurrence + Adaptive Markov Curriculum + Legal TTT — val_bpb 1.1047

by MertyandimataView on GitHub
val_bpb
1.1047
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,888,861 bytes

Training Techniques

Architecture
depth recurrence
Uses recurrent application across depth to improve model capacity/efficiency.
parameters: null
Other
other
Adaptive Markov curriculum using bigram-surprise-weighted loss scaling to prioritize learnable token sequences.
parameters: {"bigram_prior":true,"loss_scaling_max":1.15}
Weight Averaging
EMA + SWA
parameters: {"blend":"30/70"}
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"value":0.04}

Novel Contributions

  • Adaptive Markov curriculum with bigram-surprise-weighted loss scaling
  • Auto-QMax budget search to fill the 16MB artifact budget
  • EMA + SWA blend instead of choosing a single averaging method
  • Depth recurrence combined with legal test-time training