val_bpb
1.0339
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15,883,866 bytes
Training Techniques
Architecture
GatedDeltaNet
FLA/GatedDeltaNet model family using K_KVShare_Wider configuration with KV sharing to trade depth for width.
parameters: {"layers":10,"model_dim":544,"heads":8,"head_dim":64,"kv_sharing_stride":2}
BigramHash
Uses bigram hash embeddings as part of the input representation.
parameters: {"hash_size":3072,"embedding_dim":112}
TrigramHash
Uses trigram embeddings as part of the input representation.
parameters: null
ReLU²
MLP uses squared ReLU activation.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: null
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval_steps":50}
Quantization
late QAT
bits: 6
scope: artifact path
Compression
zstd
level: 22
Novel Contributions
- Independent 3-seed reproduction of K_KVShare_Wider FLA
- KV sharing stride=2 to buy width rather than depth
- Late Int6 QAT with STE during low-learning-rate phase
- EMA plus SWA training recipe
- Int6 plus zstd-22 artifact compression
- No TTT / no SLOT / no n-gram overlay / no XSA eval