PR #2005
openRecord: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed)
by jamesEmerson112View on GitHub
val_bpb
1.0805
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.74 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
depth recurrence
Recurrent reuse of layers 3-5 to create virtual depth.
parameters: {"layers":[3,4,5]}
U-Net skip connections
Sigmoid-gated skip connections bridging encoder and decoder paths.
parameters: null
XSA
Exclusive Self-Attention applied across all layers.
parameters: null
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
LeakyReLU
LeakyReLU squared MLP activation.
parameters: {"activation":"LeakyReLU^2"}
Gated Attention
Headwise gated attention with a post-attention sigmoid gate per head.
parameters: {"gate_dim":null}
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 7
scope: embeddings
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"row_normalized":true,"newton_schulz_steps":5}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"scope":"embeddings and scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"chunk_size":32000,"epochs_per_chunk":3,"learning_rate":0.005,"momentum":0.9}
LR Schedule
warmdown
parameters: {"final_fraction":0.72,"lr_end":0}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Novel Contributions
- Headwise gated attention: a post-attention sigmoid gate applied per head after FA3+XSA.
- Systematic 29-paper survey and 40+ experiment ablation study on techniques transferable to the 36M-parameter regime.
- Discovery of an EMA decay scaling law at short training durations.
- Documentation of multiple negative-result techniques that failed to transfer at Parameter Golf scale.