val_bpb
0.7406
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.75-15.82 MB
Training Techniques
Evaluation
sliding window eval
parameters: {"stride":96}
Architecture
SLOT
Per-sample test-time optimization of a hidden delta and logit bias with frozen model weights.
parameters: {"hidden_delta_shape":"[bsz, 1, 512]","logit_bias_shape":"[bsz, 1, 1024]"}
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
Leaky ReLU activation used in the MLP.
parameters: {"slope":0.5}
MLP3x
Three-layer MLP.
parameters: {"layers":3}
VE128
Value residual / value embedding enhancement with 128-dimensional setting.
parameters: {"dimensions":128}
BigramHash
Bigram hash embedding with 1024 buckets.
parameters: {"buckets":1024}
XSA
XSA applied across all layers.
parameters: {"layers":11}
Partial RoPE
Partial rotary positional embedding.
parameters: {"partial":"16/64"}
SmearGate
SmearGate gating mechanism.
parameters: null
U-Net skip connections
U-Net style skip connections.
parameters: null
Optimizer
AdamW
weight_decay: 1e-8
momentum: null
other_params: {"steps":48}
Compression
lzma
level: null
Quantization
late QAT
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
LN scale
parameters: null
weight decay
parameters: {"value":0.04}
Sequence Length
sequence_length
train_length: null
eval_length: null
LR Schedule
cosine decay
parameters: {"start_lr":0.012,"end_lr":0.001}
Novel Contributions
- SLOT-48 test-time optimization with 48 AdamW steps
- Improved val_bpb to 0.7406 using the same model and training as PR #1313
- Scaling SLOT from 24 to 48 steps produced a large BPB gain
- Frozen-model evaluation with per-window throwaway hidden delta and logit bias
- Scored-position masking during SLOT evaluation