PR #1639
openAdaptive Test-Time Training (TTT) with continuous LR-scaling.
by kunwar-vikrantView on GitHub
val_bpb
1.0832
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.28 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: matrices
int8
bits: 8
scope: embeddings
Architecture
depth recurrence
3-layer depth recurrence with loops activated during training
parameters: {"loops":"3-5","activated_at":"35% training"}
Partial RoPE
Partial rotary positional embeddings
parameters: {"layers":"16/64"}
XSA
XSA used in all layers
parameters: null
LeakyReLU
Leaky ReLU activation in the MLP
parameters: {"slope":0.5}
U-Net skip connections
Parallel residual / skip-style connections in later layers
parameters: {"layers":"L7+"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R"}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.72}
Regularization
layerwise LN scale
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"base_epochs":3,"adaptive_epochs":true,"max_epochs":5,"min_epochs":1,"ema_alpha":0.3}
Evaluation
sliding window eval
parameters: {"window":32768}
Compression
brotli
level: 11
Novel Contributions
- Adaptive test-time training with per-chunk difficulty-aware epoch allocation
- Use of chunk NLL relative to a running EMA mean to estimate chunk difficulty
- Discovery that chunk-level NLL variance is too narrow for integer epoch allocation to matter
- Proposal of continuous LR-scaling to avoid rounding dead-zones in adaptive TTT