PR #183

open

Non-record: Cache LM + LoRA TTT (negative result on cache, positive on TTT)

by anantdgoelView on GitHub
val_bpb
1.2529
Architecture
Transformer
Optimizer
Adam
Artifact Size
15.14 MB

Training Techniques

Quantization
int8
bits: 8
scope: model artifact
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
Optimizer
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.01,"betas":[0.9,0.95]}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":128}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"chunk_size":256,"batch_size":64}
Other
other
Unigram cache language model interpolation during evaluation, using decayed per-document token frequency counts.
parameters: {"lambda":0.02,"decay":0.98}
Sequence Length
sequence_length
train_length: null
eval_length: 1024

Novel Contributions

  • Per-document LoRA test-time training on Q/V/lm_head with score-first updates
  • Unigram cache language model interpolation during evaluation
  • Negative result showing the unigram cache LM hurts on FineWeb
  • Combination of LoRA TTT and cache LM in eval-time pipeline