PR #650

open

-0.0041 BPB by Reordering Training Data (Curriculum Learning)

by abaybektursunView on GitHub
val_bpb
1.1187
Architecture
Transformer
Optimizer
Artifact Size
~15.9 MB

Training Techniques

Test-Time Training
Legal TTT
parameters: null
Other
other
Reordering training data shards by model perplexity (hardest-first) to improve training efficiency and final val_bpb without changing model architecture or hyperparameters
parameters: {"shard_order_env_var":"SHARD_ORDER","ranking_model":"6-layer, 512d model trained 500 steps on shard 0","ranking_metric":"cross-entropy loss","ordering":"descending loss (hardest first)"}

Novel Contributions

  • Demonstrated that reordering training shards by model difficulty (perplexity) improves validation BPB by about -0.0033 on average without any model or hyperparameter changes
  • Showed that both hardest-first and easiest-first shard orderings outperform the default sequential ordering, indicating the default order contains harmful structure
  • Introduced a simple method to rank shards by training a small model briefly on one shard and scoring all shards by cross-entropy loss
  • Highlighted that token frequency statistics fail to capture shard difficulty differences that the model's perplexity reveals
  • Proposed that adaptive or iterative re-ranking of shards during training could further improve results
  • Provided a minimal code change to implement shard reordering via an environment variable
  • Validated improvements across three random seeds with consistent gains
  • Raised the hypothesis that the improvement is due to breaking accidental structure in sequential shard ordering rather than curriculum learning