PR #2068
openRecord: PR1797Base + HadamardRotation + ValueResidualLearning - val_bpb 1.06172
by jayaram1125View on GitHub
val_bpb
1.0617
Architecture
Transformer
Optimizer
—
Artifact Size
~15.97 MB
Training Techniques
Architecture
Value Residual
Blends the value tensor from layer 0 into later attention layers during training using a learned per-layer mixing scalar.
parameters: {"layers":11}
SmearGate
Inherited smear gate mechanism from the PR#1797 base stack.
parameters: {"window":12}
Gated Attention
Sparse/gated attention components are enabled in the run configuration.
parameters: null
Quantization
GPTQ
bits: 6
scope: weight matrices
Other
other
Hadamard rotation applied as a post-training quantization pre-processing step to reduce outlier columns and improve GPTQ compression.
parameters: {"seed":12648430}
Compression
brotli
level: null
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Regularization
logit softcap
parameters: {"value":30}
Novel Contributions
- Hadamard rotation before GPTQ quantization to reduce outlier columns and improve int6 compression
- Value Residual Learning with learned blending of layer-0 values into later attention layers
- Combination of PR#1797 base stack with SmearGate and LQER asym components