PR #649

open

Record: 1.2073 bpb • 11L gold6 • 8xH100

by pall23-mechView on GitHub

val_bpb

1.2073

Architecture

Transformer

Optimizer

Muon

Artifact Size

under 16 MB

Training Techniques

Architecture

tied embeddings

Embedding weights are tied to output weights to reduce parameters

parameters: null

BigramHash

Bigram hash embedding used to improve embedding efficiency

parameters: null

RoPE

Rotary positional embeddings with rope_dims=16

parameters: {"dimensions":16}

XSA

Cross self-attention enabled on last 4 layers

parameters: {"layers":4}

KV head count

8 attention heads with 4 key-value heads (GQA)

parameters: {"attention_heads":8,"kv_heads":4}

layerwise residual mixing

Layerwise residual mixing applied

parameters: null

LN scaling

LayerNorm scaling enabled

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"momentum_warmup_steps":20,"Adam/AdamW":"used for embeddings, scalar params, head params"}

Weight Averaging

EMA

parameters: null

Quantization

mixed int6

bits: 6

scope: all

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmup

parameters: {"warmup_steps":20}

Novel Contributions

Use of mixed int6 quantization with per-row scales combined with zstd-22 compression to fit under 16MB artifact size
Tuned 11-layer GPT model with 8 attention heads and 4 KV heads (GQA) trained on 8x H100 GPUs under a strict 600-second wallclock limit
Empirical finding that smaller global batch size (TRAIN_BATCH_TOKENS=262144) yields better validation bpb on degraded multi-GPU H100 infrastructure compared to larger batch sizes
Use of Muon optimizer with tuned momentum warmup for matrix parameters and Adam/AdamW for embeddings and scalar parameters
Application of EMA to final weights for improved validation performance
Inclusion of bigram hash embedding and layerwise residual mixing with LN scaling
Use of RoPE with rope_dims=16 and enabling cross self-attention (XSA) on last 4 layers