PR #1318
openRecord: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955
by renqianluoView on GitHub
val_bpb
1.0095
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.71 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
QAT
bits: null
scope: all
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"learning_rate":0.001}
Test-Time Training
full TTT
parameters: {"learning_rate":0.001,"epochs":1,"frozen_blocks":"0-9"}
Other
other
Sliding-window logit-space delta optimization with L-BFGS warm-start and focal loss on the last 128 tokens per window
parameters: {"method":"L-BFGS25","history":20,"warm_start":true,"delta_clip":5,"logit_space":true,"focal_tokens":128}
Regularization
logit softcap
parameters: {"delta_clip":5}
Architecture
BigramHash
Bigram hash embedding/vocabulary component
parameters: {"shape":"3072x112"}
GQA
Grouped query attention
parameters: {"heads":"8/4"}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions
parameters: {"dimensions":16}
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"negative_slope":0.5}
SmearGate
SmearGate gating mechanism
parameters: null
U-Net skip connections
U-Net style skip connections across layers
parameters: null
XSA
XSA applied to all layers
parameters: {"layers":11}
Weight Averaging
EMA + SWA
parameters: null
Compression
lzma
level: 9
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Evaluation
sliding window eval
parameters: {"window_size":2048,"focal_tokens":128}
Novel Contributions
- Test-time training with AdamW on the test sequence before scoring
- Sliding-window logit-space delta optimization using L-BFGS with warm-start
- Using GPTQ with reduced damping (0.005) to improve int6 quantization quality
- Combining TTT, SLOT, and GPTQ into a single high-performing evaluation pipeline