val_bpb
1.2767
Architecture
Transformer
Optimizer
—
Artifact Size
15858193 bytes
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP weights (Int5), attention weights (Int6)
Architecture
BigramHash
Increased BigramHash buckets to 10,240 to reduce hash collisions for consecutive token pairs
parameters: {"buckets":10240}
Attention-Residuals
Removed standard layer-by-layer residuals; model keeps rolling history of previous layer outputs and uses learned query with Softmax-weighted scoring over this history
parameters: null
Weight Averaging
SWA
parameters: {"type":"highly selective"}
Novel Contributions
- Mixed Int5/Int6 quantization allowing addition of a 10th transformer layer
- Expanded BigramHash to 10,240 buckets to reduce hash collisions
- Highly selective Stochastic Weight Averaging (SWA) strategy
- Attention-Residuals mechanism replacing standard residual connections with a rolling history and learned query Softmax weighting