PR #179

open

Record: 11L, int6+zstd, decoupled WD (val_bpb = 1.1472)

val_bpb

1.1472

Architecture

GPT

Optimizer

Muon

Artifact Size

15,905,331 bytes

Training Techniques

Quantization

int6

bits: 6

scope: MLP and attention weights; embeddings kept in fp16

Architecture

GQA / KV head count

GPT with grouped-query attention using fewer KV heads than attention heads

parameters: {"layers":11,"num_heads":8,"num_kv_heads":4}

Optimizer

Muon

weight_decay: 0.038

momentum: 0.99

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.03,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmup and warmdown

parameters: {"warmup_steps":1500,"warmdown_steps":3000}

Regularization

weight decay

parameters: {"weight_decay":0.038}

Other

other

Val-only training on the validation shard for the non-record aside submission

parameters: {"train_files":"fineweb_val_*.bin"}