PR #2136
openRecord candidate: PR #2130 base + GATE_WINDOW=8 + GPTQ_CALIBRATION_BATCHES=32
by codemath3000View on GitHub
val_bpb
1.0567
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
SmearGate
Sparse attention gating with a fixed BOS smear gate and sparse attention gate.
parameters: {"scale":0.5,"gate_window":8}
XSA
Cross/self-attention style architectural component applied across all layers.
parameters: {"layers":11}
Partial RoPE
Rotary position embeddings applied partially.
parameters: {"dimensions":16}
depth recurrence
Layers 3-5 are looped recurrently.
parameters: {"layers":[3,4,5],"frac":0.35}
GQA
Grouped-query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.028,"beta2":0.99,"adam_on_embeddings_and_scalars":true}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: matrices
int7
bits: 7
scope: embeddings
LQER asymmetric rank-4
bits: null
scope: all
Test-Time Training
LoRA TTT
parameters: {"rank":80,"learning_rate":0.00008,"beta2":0.99,"weight_decay":2,"num_phases":1,"prefix_docs":2500,"score_first":true}
Regularization
logit softcap
parameters: {"method":"AsymLogit Rescale","init":30,"global_ttt":true}
weight decay
parameters: {"value":2}
Other
other
Token-only n-gram tilt with strictly causal token channel enabled and within-word/word-start channels disabled.
parameters: {"token_order":16,"token_threshold":0.8,"token_boost":2.625}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Novel Contributions
- Reduced GATE_WINDOW from 12 to 8
- Increased GPTQ_CALIBRATION_BATCHES from 16 to 32
- Isolated two-knob ablation on top of the PR #2130 stack