val_bpb
1.2073
Architecture
Transformer
Optimizer
Muon
Artifact Size
under 16 MB
Training Techniques
Architecture
tied embeddings
Embedding weights are tied to output weights to reduce parameters
parameters: null
BigramHash
Bigram hash embedding used to improve embedding efficiency
parameters: null
RoPE
Rotary positional embeddings with rope_dims=16
parameters: {"dimensions":16}
XSA
Cross self-attention enabled on last 4 layers
parameters: {"layers":4}
KV head count
8 attention heads with 4 key-value heads (GQA)
parameters: {"attention_heads":8,"kv_heads":4}
layerwise residual mixing
Layerwise residual mixing applied
parameters: null
LN scaling
LayerNorm scaling enabled
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"momentum_warmup_steps":20,"Adam/AdamW":"used for embeddings, scalar params, head params"}
Weight Averaging
EMA
parameters: null
Quantization
mixed int6
bits: 6
scope: all
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmup
parameters: {"warmup_steps":20}
Novel Contributions
- Use of mixed int6 quantization with per-row scales combined with zstd-22 compression to fit under 16MB artifact size
- Tuned 11-layer GPT model with 8 attention heads and 4 KV heads (GQA) trained on 8x H100 GPUs under a strict 600-second wallclock limit
- Empirical finding that smaller global batch size (TRAIN_BATCH_TOKENS=262144) yields better validation bpb on degraded multi-GPU H100 infrastructure compared to larger batch sizes
- Use of Muon optimizer with tuned momentum warmup for matrix parameters and Adam/AdamW for embeddings and scalar parameters
- Application of EMA to final weights for improved validation performance
- Inclusion of bigram hash embedding and layerwise residual mixing with LN scaling
- Use of RoPE with rope_dims=16 and enabling cross self-attention (XSA) on last 4 layers