PR #2051

closed

Record: PR #1908 base + GPTQ module-damp + Asym Logit Rescale — val_bpb 1.06048 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.0605
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.87 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: module-specific
int8
bits: 8
scope: AWQ-lite salient channels
Architecture
SparseAttnGate
Sparse per-head gate inside attention
parameters: null
SmearGate
BOS-masked causal smear gate with windowed lookback
parameters: {"window":12}
depth recurrence
Triple-loop recurrence on encoder slice
parameters: {"layers":[3,5],"loops":2}
Gated Attention
Attention gating used in the model stack
parameters: null
Regularization
logit softcap
parameters: {"asymmetric":true}
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2500,"lora_rank":80}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Optimizer
AdamW
weight_decay: 0.5
momentum: null
other_params: {"beta2":0.99,"grad_clip_norm":0.3,"min_lr":0.1,"matrix_lr":0.026,"global_ttt_momentum":0.9}
Compression
lrzip
level: null

Novel Contributions

  • GPTQ per-module damping with separate damping factors for embeddings, MLP, and attention
  • Composition with asymmetric logit rescale
  • Extension of the PR #1908 base quantization stack with module-specific GPTQ regularization
  • Record-setting 3-seed mean validation BPB of 1.06048