PR #2071

open

Record: SP8192 + Headwise Gate + EMA 0.990 + Small Batch (1.0066 BPB, 3-seed)

by jamesEmerson112View on GitHub

val_bpb

1.0066

Architecture

Transformer

Optimizer

MuonEq-R

Artifact Size

~15.97 MB

Training Techniques

Architecture

Gated Attention

Post-attention per-head sigmoid gate applied after FA3+XSA; Q projection widened by gate_dim.

parameters: {"gate_dim":null,"per_head":true}

weight tying

Tied input and output embeddings.

parameters: null

depth recurrence

Repeated layers in encoder/decoder recurrence pattern.

parameters: {"layers":[3,4,5]}

XSA

Uses XSA attention across all layers.

parameters: {"layers":11}

Partial RoPE

Applies RoPE to a subset of dimensions.

parameters: {"dimensions":"16/64"}

LeakyReLU

Uses LeakyReLU squared in the MLP.

parameters: {"squared":true,"slope":0.5}

Weight Averaging

EMA

parameters: {"decay":0.99}

Optimizer

MuonEq-R

weight_decay: 0.095

momentum: null

other_params: null

AdamW

weight_decay: 0.02

momentum: null

other_params: {"scope":"embeddings/scalars"}

Quantization

GPTQ

bits: 6

scope: matrices and embeddings

GPTQ

bits: 6

scope: token embeddings

Compression

Brotli

level: 11

Test-Time Training

LoRA TTT

parameters: {"rank":96,"learning_rate":0.0001,"weight_decay":1}

Evaluation

score-first TTT

parameters: {"eval_seq_len":2048,"chunk_based":true}

Sequence Length

sequence_length

train_length: 196608

eval_length: 2048

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Novel Contributions

Headwise gated attention with per-head sigmoid gating after FA3+XSA
EMA decay 0.990 found to be optimal for the limited training window
Small batch training (196K tokens) to increase optimizer steps within the same wall-clock budget
6-bit embedding quantization to fit the full technique stack under the 16 MB limit