PR #196

open

Add non-record submission: 8xH100 FineWeb baseline + TTT eval (val_bpb 1.3825)

by sicauzxlView on GitHub

val_bpb

1.3825

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,818,566 bytes

Training Techniques

Quantization

int8

bits: 8

scope: all

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"matrix_lr":0.06,"scalar_lr":0.06,"muon_momentum_warmup_start":0.85,"muon_momentum_warmup_steps":100}

Test-Time Training

TTT

parameters: {"run_ttt_eval":1}

Architecture

KV head count

Uses grouped-query style attention with fewer KV heads than attention heads.

parameters: {"num_heads":12,"num_kv_heads":4}

MLP3x

Uses an expanded MLP multiplier of 3.

parameters: {"mlp_mult":3}

Initialization

q_gain init

Initializes q_gain with value 3.5.

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Regularization

gradient clipping

parameters: {"grad_clip_norm":0.5}

Other

other

Trains on validation data as part of the setup.

parameters: {"train_on_val":1}

other

Quantization-aware training with delayed start.

parameters: {"qat_enable":1,"qat_start_frac":0.1}

Novel Contributions

Non-record 8xH100 FineWeb baseline submission evaluated through the official train_gpt.py val_bpb path
Quantization-aware training with int8 artifact accounting under the 16,000,000-byte cap
TTT-enabled evaluation on the official FineWeb validation logic
Documentation of a strong compliant baseline configuration and its reported val_bpb