PR #2136

open

Record candidate: PR #2130 base + GATE_WINDOW=8 + GPTQ_CALIBRATION_BATCHES=32

by codemath3000View on GitHub

val_bpb

1.0567

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

SmearGate

Sparse attention gating with a fixed BOS smear gate and sparse attention gate.

parameters: {"scale":0.5,"gate_window":8}

XSA

Cross/self-attention style architectural component applied across all layers.

parameters: {"layers":11}

Partial RoPE

Rotary position embeddings applied partially.

parameters: {"dimensions":16}

depth recurrence

Layers 3-5 are looped recurrently.

parameters: {"layers":[3,4,5],"frac":0.35}

GQA

Grouped-query attention with fewer KV heads than query heads.

parameters: {"query_heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.028,"beta2":0.99,"adam_on_embeddings_and_scalars":true}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: matrices

int7

bits: 7

scope: embeddings

LQER asymmetric rank-4

bits: null

scope: all

Test-Time Training

LoRA TTT

parameters: {"rank":80,"learning_rate":0.00008,"beta2":0.99,"weight_decay":2,"num_phases":1,"prefix_docs":2500,"score_first":true}

Regularization

logit softcap

parameters: {"method":"AsymLogit Rescale","init":30,"global_ttt":true}

weight decay

parameters: {"value":2}

Other

other

Token-only n-gram tilt with strictly causal token channel enabled and within-word/word-start channels disabled.

parameters: {"token_order":16,"token_threshold":0.8,"token_boost":2.625}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Novel Contributions

Reduced GATE_WINDOW from 12 to 8
Increased GPTQ_CALIBRATION_BATCHES from 16 to 32
Isolated two-knob ablation on top of the PR #2130 stack