PR #2135

RECORDopen

Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32

by codemath3000View on GitHub

val_bpb

1.0567

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

Partial RoPE

Uses partial rotary positional embeddings.

parameters: {"dimensions":16}

depth recurrence

Layers 3-5 are looped recurrently.

parameters: {"layers":[3,4,5],"frac":0.35}

XSA

Applies XSA across all layers.

parameters: {"layers":11}

SmearGate

Uses BOS-fixed SmearGate gating.

parameters: null

Gated Attention

Uses SparseAttnGate gating in attention.

parameters: {"scale":0.5}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"lr":0.028,"scope":"matrix params"}

Adam

weight_decay: null

momentum: null

other_params: {"scope":"embedding/scalars","beta2":0.99}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: matrices

int7

bits: 7

scope: embeddings

LQER

bits: 4

scope: asymmetric rank-4

Test-Time Training

LoRA TTT

parameters: {"rank":80,"learning_rate":0.00008,"beta2":0.99,"weight_decay":2,"phases":1,"prefix_docs":2500,"score_first":true}

Regularization

logit softcap

parameters: {"type":"AsymLogit Rescale","init":30}

Other

other

Token-only n-gram tilt with strictly causal token channel enabled and within-word/word-start channels disabled.

parameters: {"token_order":16,"token_threshold":0.8,"token_boost":2.625}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Novel Contributions

Increases GPTQ calibration batches from 16 to 32 while keeping the PR #2130 stack otherwise unchanged.
Presents a clean ablation isolating the effect of denser GPTQ Hessian calibration on validation BPB.
Retains full validation coverage and the same training/evaluation pipeline as the baseline.