PR #2130

open

Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1

by TanishGudiseView on GitHub

val_bpb

1.0567

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.95 MB

Training Techniques

Architecture

SmearGate

BOS-fixed smear gate and sparse attention gating used in the base stack.

parameters: {"gate_window":12,"scale":0.5}

XSA

XSA applied across all layers.

parameters: {"layers":11}

Partial RoPE

Partial rotary position embeddings used.

parameters: {"dimensions":16}

depth recurrence

Layers 3-5 are looped recurrently.

parameters: {"layers":[3,4,5],"frac":0.35}

Gated Attention

Sparse attention gate used in the model.

parameters: {"scale":0.5,"window":12}

Quantization

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 7

scope: embeddings

mixed int6/int7

bits: null

scope: model weights

GPTQ

bits: null

scope: mixed quantization with LQER asymmetric rank-4

Optimizer

Muon

weight_decay: 2

momentum: null

other_params: {"matrix_lr":0.028,"beta2":0.99}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Test-Time Training

score-first TTT

parameters: {"rank":80,"learning_rate":0.00008,"num_phases":1,"prefix_docs":2500}

Regularization

logit softcap

parameters: {"asymmetric":true,"softcap_pos":null,"softcap_neg":null}

weight decay

parameters: {"muon_huber":true,"delta":0.1}

Evaluation

n-gram tilt

parameters: {"token_only":true,"token_order":16,"token_threshold":0.8,"token_boost":2.625}

Sequence Length

sequence_length

train_length: 2560

eval_length: 2560

Other

other

AsymLogit Rescale with trainable positive/negative logit softcaps adapted globally during TTT.

parameters: {"global_ttt":true}

Novel Contributions

Token-only causal n-gram tilt with within-word and word-start channels disabled
AsymLogit Rescale using separate positive and negative softcaps
Three hyperparameter levers from PR #2060
Single-phase TTT prefix pass (NUM_PHASES=1)
Score-first phased TTT where scoring precedes gradient updates