PR #2132

open

Record candidate: PR #2014 base + GPTQ_CALIBRATION_BATCHES=32

by codemath3000View on GitHub
val_bpb
1.0576
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Quantization
GPTQ
bits: 6
scope: matrices
int7
bits: 7
scope: embeddings
mixed int6/int7/int8
bits: null
scope: model weights
Architecture
Partial RoPE
Uses rotary position embeddings on a subset of dimensions.
parameters: {"dimensions":16}
depth recurrence
Loops layers 3-5 with recurrence enabled partway through training.
parameters: {"layers":[3,4,5],"frac":0.35}
XSA
Applies XSA across all layers.
parameters: {"layers":11}
SmearGate
Uses BOS-fixed SmearGate in the attention stack.
parameters: null
Gated Attention
Uses sparse attention gating with a quantized gate.
parameters: {"scale":0.5}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"beta2":0.99,"matrix_lr":0.026,"min_lr":0.1}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
stride-based eval
parameters: {"stride":1536}
Test-Time Training
score-first TTT
parameters: {"rank":80,"learning_rate":0.0001,"local_lr_mult":0.75}
Sequence Length
sequence_length
train_length: 3072
eval_length: 3072
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Regularization
weight decay
parameters: {"value":0.5}
Compression
pergroup
level: null

Novel Contributions

  • Increased GPTQ calibration batches from 16 to 32 while keeping the PR #2014 stack otherwise unchanged.
  • Ablation-style comparison against PR #2014 to isolate the effect of denser GPTQ Hessian calibration.
  • Full validation coverage with val_tokens matching target_tokens across all seeds.