PR #2135
RECORDopenRecord candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32
by codemath3000View on GitHub
val_bpb
1.0567
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"dimensions":16}
depth recurrence
Layers 3-5 are looped recurrently.
parameters: {"layers":[3,4,5],"frac":0.35}
XSA
Applies XSA across all layers.
parameters: {"layers":11}
SmearGate
Uses BOS-fixed SmearGate gating.
parameters: null
Gated Attention
Uses SparseAttnGate gating in attention.
parameters: {"scale":0.5}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"lr":0.028,"scope":"matrix params"}
Adam
weight_decay: null
momentum: null
other_params: {"scope":"embedding/scalars","beta2":0.99}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: matrices
int7
bits: 7
scope: embeddings
LQER
bits: 4
scope: asymmetric rank-4
Test-Time Training
LoRA TTT
parameters: {"rank":80,"learning_rate":0.00008,"beta2":0.99,"weight_decay":2,"phases":1,"prefix_docs":2500,"score_first":true}
Regularization
logit softcap
parameters: {"type":"AsymLogit Rescale","init":30}
Other
other
Token-only n-gram tilt with strictly causal token channel enabled and within-word/word-start channels disabled.
parameters: {"token_order":16,"token_threshold":0.8,"token_boost":2.625}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Novel Contributions
- Increases GPTQ calibration batches from 16 to 32 while keeping the PR #2130 stack otherwise unchanged.
- Presents a clean ablation isolating the effect of denser GPTQ Hessian calibration on validation BPB.
- Retains full validation coverage and the same training/evaluation pipeline as the baseline.