PR #1987

open

Record: MHA Path + 1855 9-hparam Stack + PR #1948 + PR #1855 (val_bpb = 1.06184, 3-seed)

by TimS-mlView on GitHub
val_bpb
1.0618
Architecture
Transformer
Optimizer
Adam
Artifact Size
~15.84 MB

Training Techniques

Architecture
MHA
Converted KV=4 GQA to KV=8 MHA, making key/value heads match query heads.
parameters: {"num_kv_heads":8}
LeakyReLU
Used LeakyReLU squared activation with slope 0.3 in the MLP.
parameters: {"slope":0.3}
depth recurrence
Included depth recurrence in layers L3-5.
parameters: {"layers":"L3-5","repeats":2}
parallel residual lanes
Added parallel residual lanes in later layers.
parameters: {"layers":"L8+"}
weight tying
Used tied embeddings.
parameters: null
SmearGate
Applied BOS-safe SmearGate.
parameters: null
Gated Attention
Used sparse attention gating with int8 gate quantization.
parameters: {"gate_scale":0.5}
XSA
Used XSA11 architecture variant.
parameters: {"layers":11}
Quantization
GPTQ
bits: 6
scope: all attn and MLP weights
GPTQ
bits: 7
scope: token embeddings
int8
bits: 8
scope: attention gate weights
Compression
lrzip pergroup
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":80,"batch_size":16,"num_phases":3,"prefix_docs":2500}
Optimizer
Adam
weight_decay: 0.5
momentum: null
other_params: {"beta2":0.99}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Regularization
weight decay
parameters: {"value":0.5}
LN scale
parameters: null

Novel Contributions

  • MHA conversion from KV=4 GQA to KV=8 MHA while staying within the artifact cap
  • Porting the PR #1855 9-hyperparameter tuning stack into the submission
  • Leaky ReLU squared slope sweep identifying 0.3 as the best setting
  • GPTQ reverse-Cholesky plus triangular solve path for faster Hinv computation
  • Using lrzip pergroup compression to recover additional byte budget