PR #2075

open

Record: Pair-Geometric Value Projection on PR1855

by deusexnaturaView on GitHub

val_bpb

1.0702

Architecture

Transformer

Optimizer

—

Artifact Size

~15.31 MB

Training Techniques

Architecture

Value Residual

Replaces the dense attention value projection with a pair-geometric value projection over the normalized hidden state, using paired halves and signed/summed features.

parameters: {"layers":11,"dimensions":512,"heads":8,"kv_heads":4}

GQA

Uses grouped query attention with 2:1 attention grouping inherited from the PR #1855 stack.

parameters: {"ratio":"2:1"}

XSA

Applies XSA in all 11 layers.

parameters: {"layers":11}

Partial RoPE

Uses partial RoPE plus layerwise LN scale.

parameters: null

LeakyReLU

Uses a LeakyReLU(0.5)^2 fused MLP activation.

parameters: {"slope":0.5}

SmearGate

Inherited sparse attention gate / SmearGate components from the PR #1855 stack.

parameters: null

Quantization

GPTQ

bits: null

scope: inherited block weights

LQER

bits: 4

scope: block weights

Test-Time Training

LoRA TTT

parameters: {"rank":80}

Regularization

layerwise LN scale

parameters: null

weight decay

parameters: {"value":0.5}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85,"warmup_steps":20}

Optimizer

Muon

weight_decay: 0.5

momentum: 0.9

other_params: {"beta2":0.99,"matrix_lr":0.026,"min_lr":0.1}

Novel Contributions

Pair-geometric value projection replaces the dense attention value matrix W_v.
Collapsed pair-geometric rule reduces to learned per-dimension coefficients on the two hidden halves.
Maintains the accepted PR #1855 stack while altering only the value projection path.
Reports a smaller artifact while removing the dense stored W_v matrix.
Includes an absolute-difference variant, though it was not used in the submitted runs.