PR #384
openNon-record: Meta-TTT + Cache/OGD Eval Stacking + Tokenizer Ablation
by anantdgoelView on GitHub
val_bpb
1.2882
Architecture
Transformer
Optimizer
Adam
Artifact Size
13.2 MB
Training Techniques
Quantization
int6
bits: 6
scope: all
Architecture
SmearGate
Architectural gating modification used in the model.
parameters: null
BigramHash
Bigram hashing feature/module used in the model.
parameters: {"buckets":4096}
Initialization
OrthoInit
Orthogonal initialization.
Weight Averaging
SWA
parameters: {"enabled":true}
Regularization
weight decay
parameters: {"muon":0.04,"adam":0.04}
gradient clipping
parameters: {"norm":0.3}
Evaluation
sliding window eval
parameters: {"stride":128}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":2}
Other
other
Eval-time unigram cache mixture combined with online gradient descent on a vocab bias vector.
parameters: {"cache_lambda":0.02,"cache_decay":0.995,"ogd_lr":0.1}
other
MAML-style meta-test-time training during training to optimize initialization for later TTT adaptation.
parameters: {"meta_loss_weight":0.5,"inner_lr":0.03,"start_frac":0.5,"every":4}
other
Tokenizer optimization using SentencePiece BPE with modified splitting settings and longer max token length.
parameters: {"split_digits":false,"split_by_unicode_script":false,"split_by_number":false,"max_sentencepiece_length":64,"vocab_size":8192}
Novel Contributions
- MAML-style Meta-TTT during training to optimize initialization for test-time adaptation
- Eval-time stacking of unigram cache mixture with online gradient descent on a vocab bias vector on top of SGD TTT
- Tokenizer optimization ablation using modified SentencePiece BPE settings
- Controlled ablations comparing meta-TTT, cache+OGD stacking, and tokenizer changes