PR #772

open

Non-record: Data ordering & selection — negative result on FineWeb

by abaybektursunView on GitHub

val_bpb

1.3055

Architecture

—

Optimizer

—

Artifact Size

—

Training Techniques

Other

other

Shard-level data selection and curriculum-style reordering based on similarity to validation data, including n-gram cosine similarity, Jensen-Shannon divergence, Moore-Lewis cross-entropy difference, domain classifier, val-trained bigram LM cross-entropy, conditional bigram embedding cosine, Wasserstein distance, and importance weighting.

parameters: {"stage":"shard-level selection","num_shards":80,"methods_tested":8}

other

Chunk-level selection of training data using bigram LM and neural proxy scoring to keep the top 12% of chunks.

parameters: {"chunk_size_tokens":32768,"total_chunks":244080,"selection_fraction":0.12}

other

Curriculum learning via hardest-first reordering of training data based on perplexity under a partially trained model.

parameters: {"seeds":[1337,42,2025],"hardware":"8xH100"}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"stride":64}

Novel Contributions

Compared 8 shard-scoring methods for FineWeb data selection and found val-trained bigram cross-entropy to be the most stable scorer.
Showed that shard-level ordering/selection has negligible effect because shard statistics are nearly identical.
Demonstrated that chunk-level selection worsens validation BPB despite lowering training loss, suggesting diversity is more important than selecting easy text.
Evaluated hardest-first curriculum learning and found only noise-level changes with no reliable improvement.
Provided a negative-result analysis arguing that FineWeb is already filtered and that data selection/curriculum methods do not help under cosine LR decay.