PR #1837

open

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"

by X-Abhishek-XView on GitHub
val_bpb
1.0706
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,961,787 B

Training Techniques

Test-Time Training
full TTT
parameters: {"learning_rate":0.000005,"momentum":0.9,"grad_clip":1,"param_subset":"all","chunk_size":48,"distributed_lockstep":true}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"grad_clip":1}
Quantization
GPTQ
bits: 6
scope: full model
SpinQuant
bits: null
scope: full model
Evaluation
sliding window eval
parameters: {"chunk_size":48,"context_length":2048}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Weight Averaging
EMA
parameters: {"decay":0.9965}
Regularization
weight decay
parameters: {"weight_decay":0.095}

Novel Contributions

  • End-to-end full-model test-time training that adapts all parameters per chunk instead of using LoRA adapters
  • Distributed lockstep gradient synchronization with all-reduce mean before each optimizer step so all ranks remain byte-identical
  • Empirical 'healing property' observation: recovery from severe post-quantization degradation to near/pre-quant performance
  • Param-subset throttling framework for ablations over all parameters, normalization scales, or control-scale tensors
  • Score-first TTT legality proof and unit tests for causal, single-pass, no-leak evaluation