PR #650
open-0.0041 BPB by Reordering Training Data (Curriculum Learning)
by abaybektursunView on GitHub
val_bpb
1.1187
Architecture
Transformer
Optimizer
—
Artifact Size
~15.9 MB
Training Techniques
Test-Time Training
Legal TTT
parameters: null
Other
other
Reordering training data shards by model perplexity (hardest-first) to improve training efficiency and final val_bpb without changing model architecture or hyperparameters
parameters: {"shard_order_env_var":"SHARD_ORDER","ranking_model":"6-layer, 512d model trained 500 steps on shard 0","ranking_metric":"cross-entropy loss","ordering":"descending loss (hardest first)"}
Novel Contributions
- Demonstrated that reordering training shards by model difficulty (perplexity) improves validation BPB by about -0.0033 on average without any model or hyperparameter changes
- Showed that both hardest-first and easiest-first shard orderings outperform the default sequential ordering, indicating the default order contains harmful structure
- Introduced a simple method to rank shards by training a small model briefly on one shard and scoring all shards by cross-entropy loss
- Highlighted that token frequency statistics fail to capture shard difficulty differences that the model's perplexity reveals
- Proposed that adaptive or iterative re-ranking of shards during training could further improve results
- Provided a minimal code change to implement shard reordering via an environment variable
- Validated improvements across three random seeds with consistent gains
- Raised the hypothesis that the improvement is due to breaking accidental structure in sequential shard ordering rather than curriculum learning