PR #1791

open

Record: K_KVShare_Wider FLA — val_bpb 1.0339 (3-seed mean)

by genji0306View on GitHub
val_bpb
1.0339
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15,883,866 bytes

Training Techniques

Architecture
GatedDeltaNet
FLA/GatedDeltaNet model family using K_KVShare_Wider configuration with KV sharing to trade depth for width.
parameters: {"layers":10,"model_dim":544,"heads":8,"head_dim":64,"kv_sharing_stride":2}
BigramHash
Uses bigram hash embeddings as part of the input representation.
parameters: {"hash_size":3072,"embedding_dim":112}
TrigramHash
Uses trigram embeddings as part of the input representation.
parameters: null
ReLU²
MLP uses squared ReLU activation.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: null
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval_steps":50}
Quantization
late QAT
bits: 6
scope: artifact path
Compression
zstd
level: 22

Novel Contributions

  • Independent 3-seed reproduction of K_KVShare_Wider FLA
  • KV sharing stride=2 to buy width rather than depth
  • Late Int6 QAT with STE during low-learning-rate phase
  • EMA plus SWA training recipe
  • Int6 plus zstd-22 artifact compression
  • No TTT / no SLOT / no n-gram overlay / no XSA eval