PR #1397

closed

Non-Record: SP1024 + Depth Recurrence + Adaptive Markov Curriculum + Auto-QMax GPTQ + TTT — val_bpb 1.1047

by MertyandimataView on GitHub

val_bpb

1.1047

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,888,861 bytes

Training Techniques

Architecture

depth recurrence

Uses recurrent depth-style computation in the model.

parameters: null

Quantization

GPTQ

bits: null

scope: all

Weight Averaging

EMA + SWA

parameters: {"blend":"30/70"}

Evaluation

sliding window eval

parameters: null

Test-Time Training

full TTT

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"warmdown_steps":3500,"layers":10,"dim":512,"heads":8,"kv_heads":4}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"value":0.04}

Other

other

Adaptive Markov curriculum using bigram-surprise-weighted loss scaling to prioritize learnable token sequences.

parameters: {"max_gradient_scale":1.15}

other

Auto-QMax budget search via binary search over clip range to maximize use of the 16MB artifact budget.

parameters: null

Novel Contributions

Adaptive Markov curriculum with bigram-surprise-weighted loss scaling
Auto-QMax budget search to fill the 16MB artifact budget
EMA + SWA blend instead of choosing a single averaging method
Sliding window evaluation
Test-time training