PR #1436

open

[Non-Record] 26.5M Int6 QAT + EMA (Pending Compute)

by DevWizard-VandanView on GitHub
val_bpb
1.5546
Architecture
Transformer
Optimizer
Muon
Artifact Size
10030236 bytes

Training Techniques

Quantization
QAT
bits: 6
scope: all
Weight Averaging
EMA
parameters: null
Evaluation
sliding window eval
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Architecture
weight tying
Tied input embeddings and output head weights.
parameters: null
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP3x
Expanded MLP width to 3x.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Compression
zstd
level: null
Other
other
EMA evaluation/serialization with int8-style artifact export and roundtrip validation.
parameters: null

Novel Contributions

  • 26.5M parameter GPT variant
  • int6 QAT
  • EMA evaluation and serialization
  • sliding-window validation
  • Muon optimizer with weight decay tuning
  • 2048-token context
  • 12-layer, 3x MLP architecture
  • zstd-compressed int8-style artifact export