PR #242
closedCrystal Curriculum — TF-IDF curriculum learning by Bee Bytez
by jamesrziggyView on GitHub
val_bpb
1.2988
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.8MB
Training Techniques
Architecture
Transformer depth
Increased the model from 9 to 10 transformer layers.
parameters: {"layers":10}
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: null
Other
other
TF-IDF-based curriculum learning via a Crystallizer module that oversamples 4 candidate batches and selects the densest batch by information density.
parameters: {"oversample":4,"warmup_frac":0.7}
LR Schedule
warmdown
parameters: {"active_frac":0.7,"decay_to_uniform":true}
Regularization
weight decay
parameters: {"weight_decay":0.02}
Test-Time Training
LoRA TTT
parameters: null
Compression
zlib
level: null
Novel Contributions
- TF-IDF-based curriculum learning using a Crystallizer module
- Oversampling candidate batches and selecting the densest batch by unigram information density
- Curriculum schedule that decays from dense-data sampling to uniform sampling
- Increased transformer depth from 9 to 10 layers
- Use of Muon optimizer with weight decay 0.02
- Application of data distillation / information-density scoring to language model pre-training