PR #649

open

Record: 1.2073 bpb • 11L gold6 • 8xH100

by pall23-mechView on GitHub
val_bpb
1.2073
Architecture
Transformer
Optimizer
Muon
Artifact Size
under 16 MB

Training Techniques

Architecture
tied embeddings
Embedding weights are tied to output weights to reduce parameters
parameters: null
BigramHash
Bigram hash embedding used to improve embedding efficiency
parameters: null
RoPE
Rotary positional embeddings with rope_dims=16
parameters: {"dimensions":16}
XSA
Cross self-attention enabled on last 4 layers
parameters: {"layers":4}
KV head count
8 attention heads with 4 key-value heads (GQA)
parameters: {"attention_heads":8,"kv_heads":4}
layerwise residual mixing
Layerwise residual mixing applied
parameters: null
LN scaling
LayerNorm scaling enabled
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"momentum_warmup_steps":20,"Adam/AdamW":"used for embeddings, scalar params, head params"}
Weight Averaging
EMA
parameters: null
Quantization
mixed int6
bits: 6
scope: all
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmup
parameters: {"warmup_steps":20}

Novel Contributions

  • Use of mixed int6 quantization with per-row scales combined with zstd-22 compression to fit under 16MB artifact size
  • Tuned 11-layer GPT model with 8 attention heads and 4 KV heads (GQA) trained on 8x H100 GPUs under a strict 600-second wallclock limit
  • Empirical finding that smaller global batch size (TRAIN_BATCH_TOKENS=262144) yields better validation bpb on degraded multi-GPU H100 infrastructure compared to larger batch sizes
  • Use of Muon optimizer with tuned momentum warmup for matrix parameters and Adam/AdamW for embeddings and scalar parameters
  • Application of EMA to final weights for improved validation performance
  • Inclusion of bigram hash embedding and layerwise residual mixing with LN scaling
  • Use of RoPE with rope_dims=16 and enabling cross self-attention (XSA) on last 4 layers