PR #632

open

non-record:10Layer + BigramHash+ SWA + Attention-Residuals

by AtomChen0425View on GitHub

val_bpb

1.2767

Architecture

Transformer

Optimizer

—

Artifact Size

15858193 bytes

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP weights (Int5), attention weights (Int6)

Architecture

BigramHash

Increased BigramHash buckets to 10,240 to reduce hash collisions for consecutive token pairs

parameters: {"buckets":10240}

Attention-Residuals

Removed standard layer-by-layer residuals; model keeps rolling history of previous layer outputs and uses learned query with Softmax-weighted scoring over this history

parameters: null

Weight Averaging

SWA

parameters: {"type":"highly selective"}

Novel Contributions

Mixed Int5/Int6 quantization allowing addition of a 10th transformer layer
Expanded BigramHash to 10,240 buckets to reduce hash collisions
Highly selective Stochastic Weight Averaging (SWA) strategy
Attention-Residuals mechanism replacing standard residual connections with a rolling history and learned query Softmax weighting