PR #1880
openRecord: AttnOutGate + SmearGate + Softcap 15 — val_bpb 1.07750 (3-seed mean)
by Meirzhan05View on GitHub
val_bpb
1.0775
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Architecture
AttnOutGate
Per-head data-dependent gate on SDPA output before output projection.
parameters: {"heads":8,"width":12,"layers":11}
SmearGate
Forward-1-token residual mixer at the embedding lane with BOS masking to prevent cross-document leakage.
parameters: {"width":12}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
depth recurrence
Layers 3-5 are looped multiple times to create virtual layers.
parameters: {"layers":[3,4,5],"loops":3,"virtual_layers":17}
XSA
XSA applied across all layers.
parameters: {"layers":11}
U-Net skip connections
U-Net style skip connections with learnable gates.
parameters: null
LeakyReLU
MLP uses LeakyReLU squared activation.
parameters: {"slope":0.5}
Regularization
logit softcap
parameters: {"cap":15}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"5_step_NS":true}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeds/scalars"}
SGD
weight_decay: null
momentum: 0.9
other_params: {"used_for":"TTT"}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"chunk_size":32000}
Quantization
GPTQ
bits: null
scope: full model
Compression
lzma
level: null
brotli
level: 11
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_pct":72}
linear warmup
parameters: {"warmup_steps":20}
Novel Contributions
- AttnOutGate per-head gating on attention outputs
- SmearGate forward-1-token residual mixing with BOS masking
- Lower logit softcap from 30 to 15
- Three additive zero-cost modifications combined to reach a new record