PR #242

closed

Crystal Curriculum — TF-IDF curriculum learning by Bee Bytez

by jamesrziggyView on GitHub
val_bpb
1.2988
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.8MB

Training Techniques

Architecture
Transformer depth
Increased the model from 9 to 10 transformer layers.
parameters: {"layers":10}
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: null
Other
other
TF-IDF-based curriculum learning via a Crystallizer module that oversamples 4 candidate batches and selects the densest batch by information density.
parameters: {"oversample":4,"warmup_frac":0.7}
LR Schedule
warmdown
parameters: {"active_frac":0.7,"decay_to_uniform":true}
Regularization
weight decay
parameters: {"weight_decay":0.02}
Test-Time Training
LoRA TTT
parameters: null
Compression
zlib
level: null

Novel Contributions

  • TF-IDF-based curriculum learning using a Crystallizer module
  • Oversampling candidate batches and selecting the densest batch by unigram information density
  • Curriculum schedule that decays from dense-data sampling to uniform sampling
  • Increased transformer depth from 9 to 10 layers
  • Use of Muon optimizer with weight decay 0.02
  • Application of data distillation / information-density scoring to language model pre-training