PR #1397

closed

Non-Record: SP1024 + Depth Recurrence + Adaptive Markov Curriculum + Auto-QMax GPTQ + TTT — val_bpb 1.1047

by MertyandimataView on GitHub
val_bpb
1.1047
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,888,861 bytes

Training Techniques

Architecture
depth recurrence
Uses recurrent depth-style computation in the model.
parameters: null
Quantization
GPTQ
bits: null
scope: all
Weight Averaging
EMA + SWA
parameters: {"blend":"30/70"}
Evaluation
sliding window eval
parameters: null
Test-Time Training
full TTT
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"warmdown_steps":3500,"layers":10,"dim":512,"heads":8,"kv_heads":4}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"value":0.04}
Other
other
Adaptive Markov curriculum using bigram-surprise-weighted loss scaling to prioritize learnable token sequences.
parameters: {"max_gradient_scale":1.15}
other
Auto-QMax budget search via binary search over clip range to maximize use of the 16MB artifact budget.
parameters: null

Novel Contributions

  • Adaptive Markov curriculum with bigram-surprise-weighted loss scaling
  • Auto-QMax budget search to fill the 16MB artifact budget
  • EMA + SWA blend instead of choosing a single averaging method
  • Sliding window evaluation
  • Test-time training