PR #1987

open

Record: MHA Path + 1855 9-hparam Stack + PR #1948 + PR #1855 (val_bpb = 1.06184, 3-seed)

by TimS-mlView on GitHub

val_bpb

1.0618

Architecture

Transformer

Optimizer

Adam

Artifact Size

~15.84 MB

Training Techniques

Architecture

MHA

Converted KV=4 GQA to KV=8 MHA, making key/value heads match query heads.

parameters: {"num_kv_heads":8}

LeakyReLU

Used LeakyReLU squared activation with slope 0.3 in the MLP.

parameters: {"slope":0.3}

depth recurrence

Included depth recurrence in layers L3-5.

parameters: {"layers":"L3-5","repeats":2}

parallel residual lanes

Added parallel residual lanes in later layers.

parameters: {"layers":"L8+"}

weight tying

Used tied embeddings.

parameters: null

SmearGate

Applied BOS-safe SmearGate.

parameters: null

Gated Attention

Used sparse attention gating with int8 gate quantization.

parameters: {"gate_scale":0.5}

XSA

Used XSA11 architecture variant.

parameters: {"layers":11}

Quantization

GPTQ

bits: 6

scope: all attn and MLP weights

GPTQ

bits: 7

scope: token embeddings

int8

bits: 8

scope: attention gate weights

Compression

lrzip pergroup

level: null

Test-Time Training

LoRA TTT

parameters: {"rank":80,"batch_size":16,"num_phases":3,"prefix_docs":2500}

Optimizer

Adam

weight_decay: 0.5

momentum: null

other_params: {"beta2":0.99}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Regularization

weight decay

parameters: {"value":0.5}

LN scale

parameters: null

Novel Contributions

MHA conversion from KV=4 GQA to KV=8 MHA while staying within the artifact cap
Porting the PR #1855 9-hyperparameter tuning stack into the submission
Leaky ReLU squared slope sweep identifying 0.3 as the best setting
GPTQ reverse-Cholesky plus triangular solve path for faster Hinv computation
Using lrzip pergroup compression to recover additional byte budget