PR #1735
openRecord: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean)
by AjAnuboluView on GitHub
val_bpb
1.0429
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.99 MB
Training Techniques
Test-Time Training
full TTT
parameters: {"enabled":true,"epochs":21,"parallel_gpus":8,"pre_quant":true}
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"pre_quant_ttt":true}
LR Schedule
cosine decay
parameters: {"scope":"epoch-level","t_max":21,"eta_min_ratio":0.1}
Quantization
GPTQ
bits: 6
scope: full model
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"single_pass":true}
Weight Averaging
EMA
parameters: null
Architecture
depth recurrence
3-layer depth recurrence with virtual layers
parameters: {"layers":3,"virtual_layers":17}
weight tying
tied embeddings / embedding tying
parameters: null
Novel Contributions
- 8-GPU parallel pre-quant AdamW TTT using federated averaging across ranks
- Epoch-level cosine learning-rate schedule across 21 TTT epochs
- torch.compile acceleration for TTT forward pass
- Fixed predictor with no eval-time adaptation, SLOT, or n-gram cache
- GPTQ int6 artifact under the 16MB limit