PR #1085

open

Non-record:11L GQA + MLP 3 + Partial RoPE + Int6 Attn/MLP + QAT40

by adityasasidharView on GitHub

val_bpb

1.2831

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.04 MiB

Training Techniques

Architecture

GQA

Uses grouped query attention with fewer KV heads than query heads.

parameters: {"query_heads":8,"kv_heads":4}

Partial RoPE

Applies rotary positional embeddings only to part of the head dimension.

parameters: {"dimensions":16}

XSA

Extended attention applied in the last layers.

parameters: {"layers":4}

MLP3x

Expands MLP blocks by a factor of 3.

parameters: {"expansion_factor":3}

depth

Uses an 11-layer model.

parameters: {"layers":11}

Quantization

mixed int6/int8

bits: 6

scope: MLP and attention projections, plus smaller tensors in int8/fp16

late QAT

bits: null

scope: final 40% of training

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

LR Schedule

warmdown

parameters: null

Evaluation

sliding window eval

parameters: {"stride":96}

Compression

zlib

level: null

Novel Contributions

11-layer Transformer with GQA
Partial RoPE
MLP 3x expansion
Mixed int6/int8 quantization of attention and MLP weights
Late QAT for the final 40% of training
Muon optimizer with warmdown schedule
Sliding-window evaluation with stride 96
zlib-compressed submission under the 16MB cap