PR #242

closed

Crystal Curriculum — TF-IDF curriculum learning by Bee Bytez

val_bpb

1.2988

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.8MB

Training Techniques

Architecture

Transformer depth

Increased the model from 9 to 10 transformer layers.

parameters: {"layers":10}

Optimizer

Muon

weight_decay: 0.02

momentum: null

other_params: null

Other

other

TF-IDF-based curriculum learning via a Crystallizer module that oversamples 4 candidate batches and selects the densest batch by information density.

parameters: {"oversample":4,"warmup_frac":0.7}

LR Schedule

warmdown

parameters: {"active_frac":0.7,"decay_to_uniform":true}

Regularization

weight decay

parameters: {"weight_decay":0.02}

Test-Time Training

LoRA TTT

parameters: null

Compression

zlib

level: null

TF-IDF-based curriculum learning using a Crystallizer module
Oversampling candidate batches and selecting the densest batch by unigram information density
Curriculum schedule that decays from dense-data sampling to uniform sampling
Increased transformer depth from 9 to 10 layers
Use of Muon optimizer with weight decay 0.02
Application of data distillation / information-density scoring to language model pre-training