PR #1065

open

[Non-Record] Competitive Baseline: 10L GQA + Mixed Int6/Int8 + SWA + Seq4096 (val_bpb=1.1536)

val_bpb

1.1536

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.74MB

Training Techniques

Architecture

GQA

10-layer Transformer using grouped query attention with 8 query heads and 4 KV heads.

parameters: {"layers":10,"num_heads":8,"num_kv_heads":4,"model_dim":512,"mlp_hidden":1536}

weight tying

Input and output embeddings share weights.

parameters: null

Quantization

mixed int6/int8

bits: 6

scope: block weights and embeddings

Compression

zstd

level: null

Weight Averaging

SWA

parameters: {"decay":0.4}

Sequence Length

sequence_length

train_length: 4096

eval_length: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.95

other_params: {"matrix_lr":0.04,"scalar_lr":0.04,"embed_lr":0.6,"head_lr":0.008}

Regularization

weight decay

parameters: {"value":0.04}

logit softcap

parameters: {"value":30}