PR #2134

open

Record candidate: PR #2130 base + GATE_WINDOW=8

by codemath3000View on GitHub
val_bpb
1.0567
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB

Training Techniques

Architecture
SmearGate
BOS-fixed SmearGate with SparseAttnGate; submission changes SparseAttnGate window from 12 to 8.
parameters: {"gate_window":8,"scale":0.5}
XSA
XSA applied across all layers.
parameters: {"layers":11}
Partial RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":16}
depth recurrence
Layers 3-5 are looped recurrently.
parameters: {"layers":[3,4,5],"frac":0.35}
Gated Attention
Sparse attention gating used in the model.
parameters: {"enabled":true}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"lr":0.028,"matrix_params":true}
Adam
weight_decay: null
momentum: null
other_params: {"beta2":0.99,"embedding_and_scalars":true}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 7
scope: embeddings
LQER
bits: 4
scope: asymmetric rank-4
Test-Time Training
LoRA TTT
parameters: {"rank":80,"learning_rate":0.00008,"beta2":0.99,"weight_decay":2,"phases":1,"prefix_docs":2500,"score_first":true}
Regularization
logit softcap
parameters: {"asym_logit_rescale":true,"init":30}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}

Novel Contributions

  • Reduced SparseAttnGate window from 12 to 8 as the only change versus PR #2130.
  • Direct ablation of GATE_WINDOW on the PR #2130 stack while keeping the rest of the training and inference pipeline identical.
  • Full validation coverage on the target validation shard with per-seed logs.