PR #1333
openRecord: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean)
by aryanbhosaleView on GitHub
val_bpb
1.0766
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~16.00 MB
Training Techniques
Architecture
depth recurrence
Recurrence applied to selected layers during training.
parameters: {"layers":[4,5]}
parallel residuals
Parallel residual pathway introduced starting from a specified layer.
parameters: {"start_layer":7}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
brotli
level: null
Optimizer
AdamW
weight_decay: 0.09
momentum: null
other_params: {"slot_lr":0.008,"slot_steps":16}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":4096}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.008,"steps":16,"delta_dim":512}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Regularization
weight decay
parameters: {"value":0.09}
Novel Contributions
- Causal SLOT evaluation with per-batch additive delta optimized only on already-scored context tokens
- Depth recurrence on layers 4 and 5
- Parallel residuals starting from layer 7
- 4096-vocabulary setup with MLP 4x and weight decay 0.090
- GPTQ int6 quantization with compressed artifact wrapper
- Provably causal sliding-window evaluation using context-only delta optimization