PR #1229

open

Record: Scored-Position SLOT + Per-Sample Delta + GPTQ (val_bpb: 0.9300)

by resouerView on GitHub
val_bpb
0.9300
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.6 MB

Training Techniques

Test-Time Training
score-first TTT
parameters: {"learning_rate":0.008,"steps":16}
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
U-Net skip connections
Sigmoid-gated skip connections using lerp blending between skip and residual paths.
parameters: {"skip_connections":5,"dimensions":512}
logit bias
Per-sample logit-space bias optimized alongside hidden-state delta.
parameters: {"shape":"[bsz,1,vocab]"}
U-Net skip connections
Per-sample delta optimized in hidden space instead of a shared delta.
parameters: {"shape":"[bsz,1,512]"}
Quantization
GPTQ
bits: 6
scope: all
Compression
Brotli
level: 11
LR Schedule
cosine decay
parameters: {"start_lr":0.008,"end_lr":0.0008,"steps":16}
Regularization
logit softcap
parameters: null

Novel Contributions

  • Scored-position SLOT mask aligned delta training to eval scoring positions
  • Per-sample delta instead of a shared delta
  • Per-sample logit bias for direct logit-space adaptation
  • Training-data GPTQ calibration using real batches instead of autoregressive self-generated data
  • Cosine learning-rate schedule for SLOT optimization
  • Sigmoid-gated skip connections
  • Brotli-11 compression with byte-shuffle
  • Lower GPTQ block size of 64