PR #2130
openRecord candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1
by TanishGudiseView on GitHub
val_bpb
1.0567
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.95 MB
Training Techniques
Architecture
SmearGate
BOS-fixed smear gate and sparse attention gating used in the base stack.
parameters: {"gate_window":12,"scale":0.5}
XSA
XSA applied across all layers.
parameters: {"layers":11}
Partial RoPE
Partial rotary position embeddings used.
parameters: {"dimensions":16}
depth recurrence
Layers 3-5 are looped recurrently.
parameters: {"layers":[3,4,5],"frac":0.35}
Gated Attention
Sparse attention gate used in the model.
parameters: {"scale":0.5,"window":12}
Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 7
scope: embeddings
mixed int6/int7
bits: null
scope: model weights
GPTQ
bits: null
scope: mixed quantization with LQER asymmetric rank-4
Optimizer
Muon
weight_decay: 2
momentum: null
other_params: {"matrix_lr":0.028,"beta2":0.99}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Test-Time Training
score-first TTT
parameters: {"rank":80,"learning_rate":0.00008,"num_phases":1,"prefix_docs":2500}
Regularization
logit softcap
parameters: {"asymmetric":true,"softcap_pos":null,"softcap_neg":null}
weight decay
parameters: {"muon_huber":true,"delta":0.1}
Evaluation
n-gram tilt
parameters: {"token_only":true,"token_order":16,"token_threshold":0.8,"token_boost":2.625}
Sequence Length
sequence_length
train_length: 2560
eval_length: 2560
Other
other
AsymLogit Rescale with trainable positive/negative logit softcaps adapted globally during TTT.
parameters: {"global_ttt":true}
Novel Contributions
- Token-only causal n-gram tilt with within-word and word-start channels disabled
- AsymLogit Rescale using separate positive and negative softcaps
- Three hyperparameter levers from PR #2060
- Single-phase TTT prefix pass (NUM_PHASES=1)
- Score-first phased TTT where scoring precedes gradient updates