PR #354

open

[Non-record] MLA + SmearGate + BigramHash + SWA — pre-quant 1.2838 bpb

val_bpb

1.2838

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.449MB

Training Techniques

Architecture

MLA

Multi-Head Latent Attention with reduced-rank KV projection to improve parameter efficiency.

parameters: {"kv_rank":128,"num_heads":8,"num_kv_heads":4}

SmearGate

SmearGate MLP using relu^2 gating.

parameters: {"mlp_mult":3}

BigramHash

BigramHash embeddings using hashed bigram buckets.

parameters: {"buckets":10240,"dim":128}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: null

Quantization

mixed int5/int6

bits: null

scope: MLP and attention

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}