PR #1705

closed

Record: K_KVShare_Wider FLA — val_bpb 1.0339 (3-seed mean)

by genji0306View on GitHub

val_bpb

1.0339

Architecture

Transformer

Optimizer

—

Artifact Size

15,870,797 bytes

Training Techniques

Architecture

GatedDeltaNet / Flash Linear Attention

FLA family using K_KVShare_Wider with KV sharing to widen the model rather than deepen it.

parameters: {"layers":10,"model_dim":544}

Quantization

late QAT

bits: 6

scope: artifact path

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_interval_steps":50}

Compression

zstd

level: 22

Other

other

SP8192 tokenizer / SentencePiece LUT handling for scorer path, including corrected byte-accounting semantics for leading-space, byte, and unused tokens.

parameters: null

Novel Contributions

K_KVShare_Wider FLA family reproduction on 8xH100 SXM
KV sharing used to buy width rather than depth
EMA + SWA + late QAT int6 artifact path
SP8192 tokenizer setup
Corrected SentencePiece byte-accounting bug in scorer path