val_bpb
1.1613
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.87 MB
Training Techniques
Architecture
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Transformer with widened MLP blocks at 3x multiplier.
parameters: {"multiplier":3}
depth recurrence
Mini depth recurrence applied to selected layers, each running twice.
parameters: {"layers":[4,5],"repeats":2}
parallel residuals
Parallel residual connections starting from layer 5 in GPT-J style.
parameters: {"start_layer":5}
Gated Attention
Per-head attention additions including q_norm_scale, k_norm_scale, k_gain, and head_gate.
parameters: {"q_norm_scale":true,"k_norm_scale":true,"k_gain":true,"head_gate":true}
Optimizer
Muon
weight_decay: 0.025
momentum: null
other_params: {"variant":"MuonEq-R-style","matrix_lr":0.05,"embed_lr":0.8}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":400,"warmup_steps":20}
Quantization
mixed int5/int6/int8
bits: null
scope: MLP/attn/embeddings
Compression
zstd
level: 22
Novel Contributions
- Combination of kv4_both attention-gate stack with mini depth recurrence and EMA.
- Systematic attention-component ablation showing head_gate as the most useful addition and k_gain as mostly flat.
- Engineering fix for bf16 EMA truncation by storing EMA state in fp32.
- 3-seed 8×H100 confirmation of a reproducible non-record result with low variance.