PR #2133

open

Record candidate: PR #2014 base + GATE_WINDOW=8 + GPTQ_CALIBRATION_BATCHES=32

by codemath3000View on GitHub

val_bpb

1.0576

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

Partial RoPE

Uses partial rotary positional embeddings.

parameters: {"dimensions":16}

depth recurrence

Loops layers 3-5 with recurrence enabled partway through training.

parameters: {"layers":[3,5],"frac":0.35}

SmearGate

BOS-fixed SmearGate used in the attention stack.

parameters: null

Gated Attention

Sparse attention gating with a fixed window over recent tokens.

parameters: {"window":8,"scale":0.5}

XSA

Applies XSA across all layers.

parameters: {"layers":11}

Optimizer

Muon

weight_decay: 0.5

momentum: null

other_params: {"beta2":0.99,"adam_on":"embedding/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Quantization

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 7

scope: embeddings

mixed int6/int7

bits: null

scope: weights + embeddings

GPTQ-lite

bits: 8

scope: AWQ-lite group quant

QAT

bits: null

scope: TTT quantized phased LoRA

Test-Time Training

score-first TTT

parameters: {"rank":80,"learning_rate":0.0001,"local_lr_mult":0.75,"mask":"no_qv","batch_size":24,"chunk_size":48}

Sequence Length

sequence_length

train_length: 3072

eval_length: 3072

Evaluation

stride-based eval

parameters: {"stride":1536}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Regularization

weight decay

parameters: {"value":0.5}

Compression

custom

level: null

Novel Contributions

Reduced SparseAttnGate window from 12 to 8
Increased GPTQ calibration batches from 16 to 32
Kept the PR #2014 stack otherwise unchanged for a clean ablation