PR #1646

open

Non-record: Int5 GPTQ + Wider MLP

by sergeevii123View on GitHub

val_bpb

1.0909

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.99 MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: matrix weights

Architecture

MLP

Widened the MLP from 4.0x to 4.8x in one experiment to trade quantization savings for extra capacity.

parameters: {"multiplier":4.8}

depth recurrence

Uses depth recurrence as part of the base SOTA stack.

parameters: {"layers":11,"loops":2,"virtual_layers":17}

XSA

Uses XSA attention component in the base architecture.

parameters: null

Partial RoPE

Applies RoPE only to part of the dimensions.

parameters: {"ratio":"16/64"}

LeakyReLU

Uses LeakyReLU squared activation.

parameters: {"squared":true,"negative_slope":0.5}

KV head count

Uses fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R"}

Test-Time Training

score-first TTT

parameters: null

Evaluation

sliding window eval

parameters: null

Regularization

skip gates

parameters: null

Novel Contributions

Clean ablation showing int5 GPTQ is worse than int6 with the same model size
Demonstrated that int5 quantization gap is about 2x worse than int6
Tested whether freeing parameter budget with int5 could support a wider MLP
Showed that a wider MLP slows training enough that capacity gains do not offset the quantization penalty
Documented that int5 GPTQ is not viable for this competition under the current setup