PR #1436

open

[Non-Record] 26.5M Int6 QAT + EMA (Pending Compute)

by DevWizard-VandanView on GitHub

val_bpb

1.5546

Architecture

Transformer

Optimizer

Muon

Artifact Size

10030236 bytes

Training Techniques

Quantization

QAT

bits: 6

scope: all

Weight Averaging

EMA

parameters: null

Evaluation

sliding window eval

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Architecture

weight tying

Tied input embeddings and output head weights.

parameters: null

GQA

Grouped query attention with fewer KV heads than query heads.

parameters: {"num_heads":8,"num_kv_heads":4}

MLP3x

Expanded MLP width to 3x.

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Compression

zstd

level: null

Other

other

EMA evaluation/serialization with int8-style artifact export and roundtrip validation.

parameters: null

Novel Contributions

26.5M parameter GPT variant
int6 QAT
EMA evaluation and serialization
sliding-window validation
Muon optimizer with weight decay tuning
2048-token context
12-layer, 3x MLP architecture
zstd-compressed int8-style artifact export