PR #632

open

non-record:10Layer + BigramHash+ SWA + Attention-Residuals

by AtomChen0425View on GitHub
val_bpb
1.2767
Architecture
Transformer
Optimizer
Artifact Size
15858193 bytes

Training Techniques

Quantization
mixed int5/int6
bits: null
scope: MLP weights (Int5), attention weights (Int6)
Architecture
BigramHash
Increased BigramHash buckets to 10,240 to reduce hash collisions for consecutive token pairs
parameters: {"buckets":10240}
Attention-Residuals
Removed standard layer-by-layer residuals; model keeps rolling history of previous layer outputs and uses learned query with Softmax-weighted scoring over this history
parameters: null
Weight Averaging
SWA
parameters: {"type":"highly selective"}

Novel Contributions

  • Mixed Int5/Int6 quantization allowing addition of a 10th transformer layer
  • Expanded BigramHash to 10,240 buckets to reduce hash collisions
  • Highly selective Stochastic Weight Averaging (SWA) strategy
  • Attention-Residuals mechanism replacing standard residual connections with a rolling history and learned query Softmax weighting