PR #2071
openRecord: SP8192 + Headwise Gate + EMA 0.990 + Small Batch (1.0066 BPB, 3-seed)
by jamesEmerson112View on GitHub
val_bpb
1.0066
Architecture
Transformer
Optimizer
MuonEq-R
Artifact Size
~15.97 MB
Training Techniques
Architecture
Gated Attention
Post-attention per-head sigmoid gate applied after FA3+XSA; Q projection widened by gate_dim.
parameters: {"gate_dim":null,"per_head":true}
weight tying
Tied input and output embeddings.
parameters: null
depth recurrence
Repeated layers in encoder/decoder recurrence pattern.
parameters: {"layers":[3,4,5]}
XSA
Uses XSA attention across all layers.
parameters: {"layers":11}
Partial RoPE
Applies RoPE to a subset of dimensions.
parameters: {"dimensions":"16/64"}
LeakyReLU
Uses LeakyReLU squared in the MLP.
parameters: {"squared":true,"slope":0.5}
Weight Averaging
EMA
parameters: {"decay":0.99}
Optimizer
MuonEq-R
weight_decay: 0.095
momentum: null
other_params: null
AdamW
weight_decay: 0.02
momentum: null
other_params: {"scope":"embeddings/scalars"}
Quantization
GPTQ
bits: 6
scope: matrices and embeddings
GPTQ
bits: 6
scope: token embeddings
Compression
Brotli
level: 11
Test-Time Training
LoRA TTT
parameters: {"rank":96,"learning_rate":0.0001,"weight_decay":1}
Evaluation
score-first TTT
parameters: {"eval_seq_len":2048,"chunk_based":true}
Sequence Length
sequence_length
train_length: 196608
eval_length: 2048
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Novel Contributions
- Headwise gated attention with per-head sigmoid gating after FA3+XSA
- EMA decay 0.990 found to be optimal for the limited training window
- Small batch training (196K tokens) to increase optimizer steps within the same wall-clock budget
- 6-bit embedding quantization to fit the full technique stack under the 16 MB limit