PR #1085
openNon-record:11L GQA + MLP 3 + Partial RoPE + Int6 Attn/MLP + QAT40
by adityasasidharView on GitHub
val_bpb
1.2831
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.04 MiB
Training Techniques
Architecture
GQA
Uses grouped query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
Partial RoPE
Applies rotary positional embeddings only to part of the head dimension.
parameters: {"dimensions":16}
XSA
Extended attention applied in the last layers.
parameters: {"layers":4}
MLP3x
Expands MLP blocks by a factor of 3.
parameters: {"expansion_factor":3}
depth
Uses an 11-layer model.
parameters: {"layers":11}
Quantization
mixed int6/int8
bits: 6
scope: MLP and attention projections, plus smaller tensors in int8/fp16
late QAT
bits: null
scope: final 40% of training
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
LR Schedule
warmdown
parameters: null
Evaluation
sliding window eval
parameters: {"stride":96}
Compression
zlib
level: null
Novel Contributions
- 11-layer Transformer with GQA
- Partial RoPE
- MLP 3x expansion
- Mixed int6/int8 quantization of attention and MLP weights
- Late QAT for the final 40% of training
- Muon optimizer with warmdown schedule
- Sliding-window evaluation with stride 96
- zlib-compressed submission under the 16MB cap