PR #2051

closed

Record: PR #1908 base + GPTQ module-damp + Asym Logit Rescale — val_bpb 1.06048 (3-seed mean)

by dexhunterView on GitHub

val_bpb

1.0605

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.87 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: module-specific

int8

bits: 8

scope: AWQ-lite salient channels

Architecture

SparseAttnGate

Sparse per-head gate inside attention

parameters: null

SmearGate

BOS-masked causal smear gate with windowed lookback

parameters: {"window":12}

depth recurrence

Triple-loop recurrence on encoder slice

parameters: {"layers":[3,5],"loops":2}

Gated Attention

Attention gating used in the model stack

parameters: null

Regularization

logit softcap

parameters: {"asymmetric":true}

Test-Time Training

score-first TTT

parameters: {"phases":3,"prefix_docs":2500,"lora_rank":80}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Optimizer

AdamW

weight_decay: 0.5

momentum: null

other_params: {"beta2":0.99,"grad_clip_norm":0.3,"min_lr":0.1,"matrix_lr":0.026,"global_ttt_momentum":0.9}

Compression

lrzip

level: null

Novel Contributions

GPTQ per-module damping with separate damping factors for embeddings, MLP, and attention
Composition with asymmetric logit rescale
Extension of the PR #1908 base quantization stack with module-specific GPTQ regularization
Record-setting 3-seed mean validation BPB of 1.06048