PR #1687
openRecord: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean)
by resouerView on GitHub
val_bpb
1.0409
Architecture
Transformer
Optimizer
—
Artifact Size
15,760,668 bytes
Training Techniques
Architecture
GatedDeltaNet
FLA / GatedDeltaNet family model with K_KVShare_Wider variant.
parameters: {"kv_sharing_stride":2,"num_swa_layers":0}
Weight Averaging
EMA + SWA
parameters: null
Quantization
late QAT
bits: 6
scope: artifact path
Novel Contributions
- K_KVShare_Wider variant using KV sharing to trade depth for width
- Fuller upstream-style FLA / GatedDeltaNet recipe
- EMA + SWA with late QAT and int6 artifact path
- Multi-seed record with 3-seed mean validation BPB