PR #74

open

Add seq4096 fp16 tok coarsen record

val_bpb

1.1884

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,937,608 bytes

Training Techniques

Quantization

mixed int8/fp16

bits: 8

scope: all weights except tok_emb.weight kept in fp16; blocks.5. selectively coarsened

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: {"tie_embeddings":1}

KV head count

Uses grouped-query style attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"warmup_start":0.92,"warmup_steps":1500}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

Compression

zlib

level: null

Adds a new 10-minute 8xH100 record for the long-context lane
Uses seq_len=4096 training with TRAIN_BATCH_TOKENS=393216
Keeps tok_emb.weight in fp16 while coarsening only blocks.5. to recover bytes
Tunes the Muon schedule for the submission
Includes canonical run plus reproducibility reruns and exact train_gpt.py snapshot