PR #74

open

Add seq4096 fp16 tok coarsen record

by takhir-iotaView on GitHub
val_bpb
1.1884
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,937,608 bytes

Training Techniques

Quantization
mixed int8/fp16
bits: 8
scope: all weights except tok_emb.weight kept in fp16; blocks.5. selectively coarsened
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: {"tie_embeddings":1}
KV head count
Uses grouped-query style attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Compression
zlib
level: null

Novel Contributions

  • Adds a new 10-minute 8xH100 record for the long-context lane
  • Uses seq_len=4096 training with TRAIN_BATCH_TOKENS=393216
  • Keeps tok_emb.weight in fp16 while coarsening only blocks.5. to recover bytes
  • Tunes the Muon schedule for the submission
  • Includes canonical run plus reproducibility reruns and exact train_gpt.py snapshot