PR #384

open

Non-record: Meta-TTT + Cache/OGD Eval Stacking + Tokenizer Ablation

by anantdgoelView on GitHub
val_bpb
1.2882
Architecture
Transformer
Optimizer
Adam
Artifact Size
13.2 MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
SmearGate
Architectural gating modification used in the model.
parameters: null
BigramHash
Bigram hashing feature/module used in the model.
parameters: {"buckets":4096}
Initialization
OrthoInit
Orthogonal initialization.
Weight Averaging
SWA
parameters: {"enabled":true}
Regularization
weight decay
parameters: {"muon":0.04,"adam":0.04}
gradient clipping
parameters: {"norm":0.3}
Evaluation
sliding window eval
parameters: {"stride":128}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":2}
Other
other
Eval-time unigram cache mixture combined with online gradient descent on a vocab bias vector.
parameters: {"cache_lambda":0.02,"cache_decay":0.995,"ogd_lr":0.1}
other
MAML-style meta-test-time training during training to optimize initialization for later TTT adaptation.
parameters: {"meta_loss_weight":0.5,"inner_lr":0.03,"start_frac":0.5,"every":4}
other
Tokenizer optimization using SentencePiece BPE with modified splitting settings and longer max token length.
parameters: {"split_digits":false,"split_by_unicode_script":false,"split_by_number":false,"max_sentencepiece_length":64,"vocab_size":8192}

Novel Contributions

  • MAML-style Meta-TTT during training to optimize initialization for test-time adaptation
  • Eval-time stacking of unigram cache mixture with online gradient descent on a vocab bias vector on top of SGD TTT
  • Tokenizer optimization ablation using modified SentencePiece BPE settings
  • Controlled ablations comparing meta-TTT, cache+OGD stacking, and tokenizer changes