PR #222
openNon-record: WiderMLP + FP16 Embed + Stride-32 (val_bpb=1.1601)
by ansh-derivView on GitHub
val_bpb
1.1601
Architecture
GPT
Optimizer
Muon
Artifact Size
18.97MB
Training Techniques
Architecture
MLP3x
Wider feedforward network increasing model capacity.
parameters: {"mlp_mult":3,"hidden_size":1536,"num_layers":10,"model_dim":512,"num_heads":8,"num_kv_heads":4}
tied embeddings
Tied token embedding weights with fp16 passthrough serialization for the embedding matrix.
parameters: {"fp16_passthrough":true}
Quantization
mixed int6/int8
bits: 6
scope: layers 2-8 int6; layers 0/1/9 int8 per-row; embeddings fp16
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_start_momentum":0.92,"warmup_steps":1500,"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}
Evaluation
stride-based sliding window eval
parameters: {"stride":32,"context_length":4096}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Compression
zlib
level: null
Novel Contributions
- Wider MLP via MLP_MULT=3 to improve capacity and validation bpb.
- fp16 tied embedding export to avoid quantization loss on the embedding matrix.
- Mixed int6/int8 quantization scheme with int6 on middle layers and int8 on edge layers.
- Stride-32 sliding window evaluation using long preceding context for better bpb.
- Tuned Muon optimizer settings including momentum warmup and separate learning rates.