PR #2095

open

Non-record: AttnGate + MiniRecur + EMA (1.1613 BPB)

by bharadwaj1098View on GitHub

val_bpb

1.1613

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.87 MB

Training Techniques

Architecture

GQA

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Transformer with widened MLP blocks at 3x multiplier.

parameters: {"multiplier":3}

depth recurrence

Mini depth recurrence applied to selected layers, each running twice.

parameters: {"layers":[4,5],"repeats":2}

parallel residuals

Parallel residual connections starting from layer 5 in GPT-J style.

parameters: {"start_layer":5}

Gated Attention

Per-head attention additions including q_norm_scale, k_norm_scale, k_gain, and head_gate.

parameters: {"q_norm_scale":true,"k_norm_scale":true,"k_gain":true,"head_gate":true}

Optimizer

Muon

weight_decay: 0.025

momentum: null

other_params: {"variant":"MuonEq-R-style","matrix_lr":0.05,"embed_lr":0.8}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":400,"warmup_steps":20}

Quantization

mixed int5/int6/int8

bits: null

scope: MLP/attn/embeddings

Compression

zstd

level: 22

Novel Contributions

Combination of kv4_both attention-gate stack with mini depth recurrence and EMA.
Systematic attention-component ablation showing head_gate as the most useful addition and k_gain as mostly flat.
Engineering fix for bf16 EMA truncation by storing EMA state in fp32.
3-seed 8×H100 confirmation of a reproducible non-record result with low variance.