PR #1687

open

Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean)

by resouerView on GitHub
val_bpb
1.0409
Architecture
Transformer
Optimizer
Artifact Size
15,760,668 bytes

Training Techniques

Architecture
GatedDeltaNet
FLA / GatedDeltaNet family model with K_KVShare_Wider variant.
parameters: {"kv_sharing_stride":2,"num_swa_layers":0}
Weight Averaging
EMA + SWA
parameters: null
Quantization
late QAT
bits: 6
scope: artifact path

Novel Contributions

  • K_KVShare_Wider variant using KV sharing to trade depth for width
  • Fuller upstream-style FLA / GatedDeltaNet recipe
  • EMA + SWA with late QAT and int6 artifact path
  • Multi-seed record with 3-seed mean validation BPB