PR #1791

open

Record: K_KVShare_Wider FLA — val_bpb 1.0339 (3-seed mean)

val_bpb

1.0339

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15,883,866 bytes

Training Techniques

Architecture

GatedDeltaNet

FLA/GatedDeltaNet model family using K_KVShare_Wider configuration with KV sharing to trade depth for width.

parameters: {"layers":10,"model_dim":544,"heads":8,"head_dim":64,"kv_sharing_stride":2}

BigramHash

Uses bigram hash embeddings as part of the input representation.

parameters: {"hash_size":3072,"embedding_dim":112}

TrigramHash

Uses trigram embeddings as part of the input representation.

parameters: null

ReLU²

MLP uses squared ReLU activation.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.95

other_params: null

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_interval_steps":50}

Quantization

late QAT

bits: 6

scope: artifact path

Compression

zstd

level: 22