PR #2144

open

Non record: Progressive context growth precursor to PR 2014, 12 hours on RTX 4090, val_bpb 0.9697 pre-quant

by simonbissonnetteView on GitHub

val_bpb

0.9698

Architecture

Transformer

Optimizer

SGD

Artifact Size

135,431,355 bytes

Training Techniques

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

sequence_length

train_length: 8192

eval_length: 8192

Architecture

GQA

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

weight tying

No explicit evidence of weight tying was provided.

parameters: null

Test-Time Training

full TTT

parameters: {"epochs":8,"chunk_tokens":32768,"learning_rate":0.005}

LR Schedule

custom

parameters: {"schedule":"1.000@0.000,1.000@0.400,0.500@0.400,0.300@0.500,0.180@0.600,0.110@0.700,0.090@0.800,0.070@1.000"}

Other

other

Progressive context growth schedule from 1024 to 8192 tokens during training.

parameters: {"schedule":"1024@0.200,2048@0.750,4096@0.850,8192@1.000"}

Optimizer

SGD

weight_decay: null

momentum: null

other_params: {"ttt_learning_rate":0.005}

Weight Averaging

EMA

parameters: {"enabled":false}

Progressive context growth up to 8k context length
Custom midrun learning-rate cap schedule
No-EMA training configuration
TTT-enabled training with 8 epochs
Castor pretraining recipe using a FineWeb/FineWeb2/FineWeb-Edu mixture with optional CommitPack shards