PR #650

open

-0.0041 BPB by Reordering Training Data (Curriculum Learning)

by abaybektursunView on GitHub

val_bpb

1.1187

Architecture

Transformer

Optimizer

—

Artifact Size

~15.9 MB

Training Techniques

Test-Time Training

Legal TTT

parameters: null

Other

other

Reordering training data shards by model perplexity (hardest-first) to improve training efficiency and final val_bpb without changing model architecture or hyperparameters

parameters: {"shard_order_env_var":"SHARD_ORDER","ranking_model":"6-layer, 512d model trained 500 steps on shard 0","ranking_metric":"cross-entropy loss","ordering":"descending loss (hardest first)"}

Novel Contributions

Demonstrated that reordering training shards by model difficulty (perplexity) improves validation BPB by about -0.0033 on average without any model or hyperparameter changes
Showed that both hardest-first and easiest-first shard orderings outperform the default sequential ordering, indicating the default order contains harmful structure
Introduced a simple method to rank shards by training a small model briefly on one shard and scoring all shards by cross-entropy loss
Highlighted that token frequency statistics fail to capture shard difficulty differences that the model's perplexity reveals
Proposed that adaptive or iterative re-ranking of shards during training could further improve results
Provided a minimal code change to implement shard reordering via an environment variable
Validated improvements across three random seeds with consistent gains
Raised the hypothesis that the improvement is due to breaking accidental structure in sequential shard ordering rather than curriculum learning