PR #1229
openRecord: Scored-Position SLOT + Per-Sample Delta + GPTQ (val_bpb: 0.9300)
by resouerView on GitHub
val_bpb
0.9300
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.6 MB
Training Techniques
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.008,"steps":16}
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
U-Net skip connections
Sigmoid-gated skip connections using lerp blending between skip and residual paths.
parameters: {"skip_connections":5,"dimensions":512}
logit bias
Per-sample logit-space bias optimized alongside hidden-state delta.
parameters: {"shape":"[bsz,1,vocab]"}
U-Net skip connections
Per-sample delta optimized in hidden space instead of a shared delta.
parameters: {"shape":"[bsz,1,512]"}
Quantization
GPTQ
bits: 6
scope: all
Compression
Brotli
level: 11
LR Schedule
cosine decay
parameters: {"start_lr":0.008,"end_lr":0.0008,"steps":16}
Regularization
logit softcap
parameters: null
Novel Contributions
- Scored-position SLOT mask aligned delta training to eval scoring positions
- Per-sample delta instead of a shared delta
- Per-sample logit bias for direct logit-space adaptation
- Training-data GPTQ calibration using real batches instead of autoregressive self-generated data
- Cosine learning-rate schedule for SLOT optimization
- Sigmoid-gated skip connections
- Brotli-11 compression with byte-shuffle
- Lower GPTQ block size of 64