val_bpb
1.0339
Architecture
Transformer
Optimizer
—
Artifact Size
15,870,797 bytes
Training Techniques
Architecture
GatedDeltaNet / Flash Linear Attention
FLA family using K_KVShare_Wider with KV sharing to widen the model rather than deepen it.
parameters: {"layers":10,"model_dim":544}
Quantization
late QAT
bits: 6
scope: artifact path
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval_steps":50}
Compression
zstd
level: 22
Other
other
SP8192 tokenizer / SentencePiece LUT handling for scorer path, including corrected byte-accounting semantics for leading-space, byte, and unused tokens.
parameters: null
Novel Contributions
- K_KVShare_Wider FLA family reproduction on 8xH100 SXM
- KV sharing used to buy width rather than depth
- EMA + SWA + late QAT int6 artifact path
- SP8192 tokenizer setup
- Corrected SentencePiece byte-accounting bug in scorer path