PR #827
openRecord: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999
by ProgrammerryokiView on GitHub
val_bpb
1.3999
Architecture
Transformer
Optimizer
—
Artifact Size
~13.5 MB
Training Techniques
Quantization
GPTQ-lite
bits: 6
scope: all weights
Architecture
XSA
Exclusive self-attention applied to the last layers; subtracts self-value from attention output so tokens attend more to context.
parameters: {"layers":4}
Partial RoPE
Rotary position encoding applied only to part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
BigramHash
Bigram hashing component used in the model.
parameters: {"buckets":1536}
SmearGate
SmearGate enabled in the architecture.
parameters: null
U-Net Skips
U-Net style skip connections enabled.
parameters: null
MLP3x
MLP widened to 2× with LeakyReLU(0.5)^2 activation.
parameters: {"multiplier":2}
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Compression
zstd
level: 22
Other
other
LeakyReLU(0.5)^2 activation replacing relu(x)^2 to preserve negative gradient flow and reduce dead neurons.
parameters: null
other
GPTQ-lite clip search over multiple clip percentiles per weight row to minimize reconstruction MSE.
parameters: {"clip_percentiles":[0.9999,0.99995,0.99999,0.999995,1]}
Novel Contributions
- LeakyReLU(0.5)^2 activation
- Exclusive self-attention (XSA) in the last 4 layers
- Layerwise LN scaling by 1/sqrt(layer+1)
- Partial RoPE using 16 of 64 head dimensions
- GPTQ-lite clip search for quantization
- Int6 QAT with zstd-22 compression