PR #2144

open

Non record: Progressive context growth precursor to PR 2014, 12 hours on RTX 4090, val_bpb 0.9697 pre-quant

by simonbissonnetteView on GitHub
val_bpb
0.9698
Architecture
Transformer
Optimizer
SGD
Artifact Size
135,431,355 bytes

Training Techniques

Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
sequence_length
train_length: 8192
eval_length: 8192
Architecture
GQA
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
No explicit evidence of weight tying was provided.
parameters: null
Test-Time Training
full TTT
parameters: {"epochs":8,"chunk_tokens":32768,"learning_rate":0.005}
LR Schedule
custom
parameters: {"schedule":"1.000@0.000,1.000@0.400,0.500@0.400,0.300@0.500,0.180@0.600,0.110@0.700,0.090@0.800,0.070@1.000"}
Other
other
Progressive context growth schedule from 1024 to 8192 tokens during training.
parameters: {"schedule":"1024@0.200,2048@0.750,4096@0.850,8192@1.000"}
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"ttt_learning_rate":0.005}
Weight Averaging
EMA
parameters: {"enabled":false}

Novel Contributions

  • Progressive context growth up to 8k context length
  • Custom midrun learning-rate cap schedule
  • No-EMA training configuration
  • TTT-enabled training with 8 epochs
  • Castor pretraining recipe using a FineWeb/FineWeb2/FineWeb-Edu mixture with optional CommitPack shards