PR #1065
open[Non-Record] Competitive Baseline: 10L GQA + Mixed Int6/Int8 + SWA + Seq4096 (val_bpb=1.1536)
by rithunkpView on GitHub
val_bpb
1.1536
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.74MB
Training Techniques
Architecture
GQA
10-layer Transformer using grouped query attention with 8 query heads and 4 KV heads.
parameters: {"layers":10,"num_heads":8,"num_kv_heads":4,"model_dim":512,"mlp_hidden":1536}
weight tying
Input and output embeddings share weights.
parameters: null
Quantization
mixed int6/int8
bits: 6
scope: block weights and embeddings
Compression
zstd
level: null
Weight Averaging
SWA
parameters: {"decay":0.4}
Sequence Length
sequence_length
train_length: 4096
eval_length: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: {"matrix_lr":0.04,"scalar_lr":0.04,"embed_lr":0.6,"head_lr":0.008}
Regularization
weight decay
parameters: {"value":0.04}
logit softcap
parameters: {"value":30}
Novel Contributions
- 10-layer Transformer with GQA under the 16MB constraint
- Mixed int6/int8 quantization with zstandard compression
- Stochastic Weight Averaging for quantization-friendly weights
- Extended training sequence length of 4096
- Muon optimizer for matrix parameters with AdamW for scalars/embeddings
- Weight tying to reduce parameter count