PR #2131

open

Record candidate: PR #2014 base + GATE_WINDOW=8

by codemath3000View on GitHub

val_bpb

1.0576

Architecture

Transformer

Optimizer

Muon

Artifact Size

<=16MB

Training Techniques

Architecture

SmearGate

BOS-fixed smear gate used in the attention stack.

parameters: null

Gated Attention

Sparse attention gate with a configurable recent-token window.

parameters: {"gate_window":8,"scale":0.5}

XSA

XSA applied across all layers.

parameters: {"layers":11}

Partial RoPE

Partial rotary position embeddings.

parameters: {"dimensions":16}

depth recurrence

Layers 3-5 are looped with recurrence enabled partway through training.

parameters: {"layers":[3,4,5],"frac":0.35}

Parallel decoder

Parallel lane begins at layer 8 and final lane is averaged.

parameters: {"start_layer":8}

Quantization

GPTQ

bits: 6

scope: matrix weights

int7

bits: 7

scope: embeddings

mixed int6/int8

bits: null

scope: model weights

Optimizer

Muon

weight_decay: 0.5

momentum: null

other_params: {"beta2":0.99,"matrix_lr":0.026,"min_lr":0.1}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

pergroup

level: null

Evaluation

stride-based eval

parameters: {"stride":1536}

Test-Time Training

LoRA TTT

parameters: {"rank":80,"learning_rate":0.0001,"local_lr_mult":0.75,"mask":"no_qv","short_score_first":true,"short_doc_len":2000,"short_chunk_size":24,"prefix_docs":2500,"num_phases":1}

Sequence Length

sequence_length

train_length: 3072

eval_length: 3072

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Regularization

weight decay

parameters: {"value":0.5}

Novel Contributions

Changes only GATE_WINDOW from 12 to 8 relative to PR #2014.
Keeps the PR #2014 stack otherwise byte-for-byte identical for a clean ablation.
Uses full validation coverage with val_tokens matching target_tokens in all reported seeds.
Applies score-first short-doc phased LoRA TTT on the CaseOps/SP8192 stack.