PR #1705

closed

Record: K_KVShare_Wider FLA — val_bpb 1.0339 (3-seed mean)

by genji0306View on GitHub
val_bpb
1.0339
Architecture
Transformer
Optimizer
Artifact Size
15,870,797 bytes

Training Techniques

Architecture
GatedDeltaNet / Flash Linear Attention
FLA family using K_KVShare_Wider with KV sharing to widen the model rather than deepen it.
parameters: {"layers":10,"model_dim":544}
Quantization
late QAT
bits: 6
scope: artifact path
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval_steps":50}
Compression
zstd
level: 22
Other
other
SP8192 tokenizer / SentencePiece LUT handling for scorer path, including corrected byte-accounting semantics for leading-space, byte, and unused tokens.
parameters: null

Novel Contributions

  • K_KVShare_Wider FLA family reproduction on 8xH100 SXM
  • KV sharing used to buy width rather than depth
  • EMA + SWA + late QAT int6 artifact path
  • SP8192 tokenizer setup
  • Corrected SentencePiece byte-accounting bug in scorer path