PR #271

open

Non-record: HyperparamTuned KV2 + FP16 Embed

val_bpb

1.3003

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

KV head count

Reduced key-value heads from 4 to 2 (GQA 8:2) to save parameters and improve throughput.

parameters: {"num_kv_heads":2}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"matrix_lr":0.048,"scalar_lr":0.03}

LR Schedule

warmdown

parameters: {"warmdown_iters":600}

Initialization

QK_GAIN_INIT

Lower initial attention sharpness via QK gain initialization set to 1.35.

Quantization

fp16

bits: 16

scope: embeddings

Systematic hyperparameter tuning of the baseline 9x512 architecture
Reduced NUM_KV_HEADS from 4 to 2
Adjusted Muon and scalar learning rates
Shortened warmdown schedule
Increased Muon momentum
Lowered QK gain initialization
Exported token embeddings in fp16 instead of int8
Focused on throughput-neutral changes to improve BPB under wallclock constraints