PR #1647
openSP8192 + SLOT-4 + TTT + 3-Layer Recurrence + Parallel Residuals (1.0616 BPB)
by powerpratikView on GitHub
val_bpb
1.0616
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~16.0MB
Training Techniques
Quantization
mixed int6/int8
bits: 6
scope: model weights
Compression
Brotli
level: null
Architecture
depth recurrence
3-layer depth recurrence activated during training
parameters: {"layers":3}
parallel residuals
GPT-J style parallel residual connections
parameters: null
LeakyReLU
LeakyReLU activation used in the MLP
parameters: {"slope":0.5}
MLP3x
4x MLP expansion in the base stack
parameters: {"multiplier":4}
Test-Time Training
score-first TTT
parameters: {"epochs_per_chunk":3}
Evaluation
sliding window eval
parameters: null
Optimizer
AdamW
weight_decay: 0.01
momentum: null
other_params: {"lr":0.01}
Regularization
weight decay
parameters: {"value":0.01}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Novel Contributions
- SLOT (Sample-Level Optimization at Test-time) with per-window logit bias optimization
- 4-step AdamW optimization of a zero-initialized delta tensor at evaluation time
- Combining SLOT with the existing PR #1493 stack to improve validation BPB
- 3-seed evaluation showing improved mean BPB to 1.0616