val_bpb
1.0702
Architecture
Transformer
Optimizer
—
Artifact Size
~15.31 MB
Training Techniques
Architecture
Value Residual
Replaces the dense attention value projection with a pair-geometric value projection over the normalized hidden state, using paired halves and signed/summed features.
parameters: {"layers":11,"dimensions":512,"heads":8,"kv_heads":4}
GQA
Uses grouped query attention with 2:1 attention grouping inherited from the PR #1855 stack.
parameters: {"ratio":"2:1"}
XSA
Applies XSA in all 11 layers.
parameters: {"layers":11}
Partial RoPE
Uses partial RoPE plus layerwise LN scale.
parameters: null
LeakyReLU
Uses a LeakyReLU(0.5)^2 fused MLP activation.
parameters: {"slope":0.5}
SmearGate
Inherited sparse attention gate / SmearGate components from the PR #1855 stack.
parameters: null
Quantization
GPTQ
bits: null
scope: inherited block weights
LQER
bits: 4
scope: block weights
Test-Time Training
LoRA TTT
parameters: {"rank":80}
Regularization
layerwise LN scale
parameters: null
weight decay
parameters: {"value":0.5}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"warmup_steps":20}
Optimizer
Muon
weight_decay: 0.5
momentum: 0.9
other_params: {"beta2":0.99,"matrix_lr":0.026,"min_lr":0.1}
Novel Contributions
- Pair-geometric value projection replaces the dense attention value matrix W_v.
- Collapsed pair-geometric rule reduces to learned per-dimension coefficients on the two hidden halves.
- Maintains the accepted PR #1855 stack while altering only the value projection path.
- Reports a smaller artifact while removing the dense stored W_v matrix.
- Includes an absolute-difference variant, though it was not used in the submitted runs.