PR #1236
openNon-record: SLOT eval-time delta optimization + QK-Gain (val_bpb=1.1179)
by ibarrajoView on GitHub
val_bpb
1.1179
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.2 MB
Training Techniques
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
SLOT (Stochastic Logit Offset Tuning): eval-time additive delta optimization on logit biases per token, with delta reset per batch and no model weight updates
parameters: {"steps":8,"delta_shape":"[1, 1, 512]"}
Architecture
QK-Gain
Per-head learnable scalar on queries after QK-norm, initialized to 4.0
parameters: {"init":4}
weight tying
Tied embeddings
parameters: null
SmearGate
SmearGate module included in the architecture
parameters: null
BigramHash
BigramHash embedding component
parameters: {"dimensions":"6144x128"}
XSA
XSA-all attention component
parameters: null
VE128
Value Embedding component
parameters: {"dimensions":128}
U-Net skip connections
U-Net style skip connections in the model
parameters: null
Partial RoPE
RoPE applied to a partial subset of dimensions
parameters: {"dimensions":16}
Quantization
GPTQ
bits: 5
scope: all
late QAT
bits: null
scope: all
Regularization
LN scale
parameters: null
magnitude pruning
parameters: {"sparsity":"10%"}
Test-Time Training
score-first TTT
parameters: {"epochs":3}
Optimizer
AdamW
weight_decay: 1e-8
momentum: null
other_params: {"lr":0.005,"eps":0.00001}
Novel Contributions
- SLOT eval-time delta optimization
- Per-batch additive logit delta optimization without modifying model weights
- QK-Gain initialization raised to 4.0
- Int5 GPTQ compressed submission with score-first evaluation pipeline