PR #196

open

Add non-record submission: 8xH100 FineWeb baseline + TTT eval (val_bpb 1.3825)

by sicauzxlView on GitHub
val_bpb
1.3825
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,818,566 bytes

Training Techniques

Quantization
int8
bits: 8
scope: all
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"matrix_lr":0.06,"scalar_lr":0.06,"muon_momentum_warmup_start":0.85,"muon_momentum_warmup_steps":100}
Test-Time Training
TTT
parameters: {"run_ttt_eval":1}
Architecture
KV head count
Uses grouped-query style attention with fewer KV heads than attention heads.
parameters: {"num_heads":12,"num_kv_heads":4}
MLP3x
Uses an expanded MLP multiplier of 3.
parameters: {"mlp_mult":3}
Initialization
q_gain init
Initializes q_gain with value 3.5.
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.5}
Other
other
Trains on validation data as part of the setup.
parameters: {"train_on_val":1}
other
Quantization-aware training with delayed start.
parameters: {"qat_enable":1,"qat_start_frac":0.1}

Novel Contributions

  • Non-record 8xH100 FineWeb baseline submission evaluated through the official train_gpt.py val_bpb path
  • Quantization-aware training with int8 artifact accounting under the 16,000,000-byte cap
  • TTT-enabled evaluation on the official FineWeb validation logic
  • Documentation of a strong compliant baseline configuration and its reported val_bpb