PR #2132

open

Record candidate: PR #2014 base + GPTQ_CALIBRATION_BATCHES=32

by codemath3000View on GitHub

val_bpb

1.0576

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Quantization

GPTQ

bits: 6

scope: matrices

int7

bits: 7

scope: embeddings

mixed int6/int7/int8

bits: null

scope: model weights

Architecture

Partial RoPE

Uses rotary position embeddings on a subset of dimensions.

parameters: {"dimensions":16}

depth recurrence

Loops layers 3-5 with recurrence enabled partway through training.

parameters: {"layers":[3,4,5],"frac":0.35}

XSA

Applies XSA across all layers.

parameters: {"layers":11}

SmearGate

Uses BOS-fixed SmearGate in the attention stack.

parameters: null

Gated Attention

Uses sparse attention gating with a quantized gate.

parameters: {"scale":0.5}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"beta2":0.99,"matrix_lr":0.026,"min_lr":0.1}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

stride-based eval

parameters: {"stride":1536}

Test-Time Training

score-first TTT

parameters: {"rank":80,"learning_rate":0.0001,"local_lr_mult":0.75}

Sequence Length

sequence_length

train_length: 3072

eval_length: 3072

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Regularization

weight decay

parameters: {"value":0.5}

Compression

pergroup

level: null

Novel Contributions

Increased GPTQ calibration batches from 16 to 32 while keeping the PR #2014 stack otherwise unchanged.
Ablation-style comparison against PR #2014 to isolate the effect of denser GPTQ Hessian calibration.
Full validation coverage with val_tokens matching target_tokens across all seeds.