PR #2134

open

Record candidate: PR #2130 base + GATE_WINDOW=8

by codemath3000View on GitHub

val_bpb

1.0567

Architecture

Transformer

Optimizer

Muon

Artifact Size

16MB

Training Techniques

Architecture

SmearGate

BOS-fixed SmearGate with SparseAttnGate; submission changes SparseAttnGate window from 12 to 8.

parameters: {"gate_window":8,"scale":0.5}

XSA

XSA applied across all layers.

parameters: {"layers":11}

Partial RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":16}

depth recurrence

Layers 3-5 are looped recurrently.

parameters: {"layers":[3,4,5],"frac":0.35}

Gated Attention

Sparse attention gating used in the model.

parameters: {"enabled":true}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"lr":0.028,"matrix_params":true}

Adam

weight_decay: null

momentum: null

other_params: {"beta2":0.99,"embedding_and_scalars":true}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 7

scope: embeddings

LQER

bits: 4

scope: asymmetric rank-4

Test-Time Training

LoRA TTT

parameters: {"rank":80,"learning_rate":0.00008,"beta2":0.99,"weight_decay":2,"phases":1,"prefix_docs":2500,"score_first":true}

Regularization

logit softcap

parameters: {"asym_logit_rescale":true,"init":30}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Novel Contributions

Reduced SparseAttnGate window from 12 to 8 as the only change versus PR #2130.
Direct ablation of GATE_WINDOW on the PR #2130 stack while keeping the rest of the training and inference pipeline identical.
Full validation coverage on the target validation shard with per-seed logs.