PR #1837

open

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"

by X-Abhishek-XView on GitHub

val_bpb

1.0706

Architecture

Transformer

Optimizer

SGD

Artifact Size

15,961,787 B

Training Techniques

Test-Time Training

full TTT

parameters: {"learning_rate":0.000005,"momentum":0.9,"grad_clip":1,"param_subset":"all","chunk_size":48,"distributed_lockstep":true}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"grad_clip":1}

Quantization

GPTQ

bits: 6

scope: full model

SpinQuant

bits: null

scope: full model

Evaluation

sliding window eval

parameters: {"chunk_size":48,"context_length":2048}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Weight Averaging

EMA

parameters: {"decay":0.9965}

Regularization

weight decay

parameters: {"weight_decay":0.095}

End-to-end full-model test-time training that adapts all parameters per chunk instead of using LoRA adapters
Distributed lockstep gradient synchronization with all-reduce mean before each optimizer step so all ranks remain byte-identical
Empirical 'healing property' observation: recovery from severe post-quantization degradation to near/pre-quant performance
Param-subset throttling framework for ablations over all parameters, normalization scales, or control-scale tensors
Score-first TTT legality proof and unit tests for causal, single-pass, no-leak evaluation