PR #1084

open

Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM

by AnubhavBharadwaajView on GitHub
val_bpb
1.1185
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB

Training Techniques

Architecture
BigramHash
Bigram hash embedding component used in the base model.
parameters: {"vocab_size":1536}
XSA
XSA attention modification using the last N tokens.
parameters: {"last_n":4}
Partial RoPE
Partial rotary positional embedding applied to a subset of dimensions.
parameters: {"dimensions":16}
VE128
Value residual / VE module with 128-dimensional value enhancement.
parameters: {"dimension":128,"layers":[9,10]}
LeakyReLU
LeakyReLU squared MLP activation used in the feed-forward blocks.
parameters: {"squared":true}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
GPTQ-lite
bits: 6
scope: model
Compression
lzma
level: 6
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768}
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"ttt_gradient_clip":1,"ttt_batch_seqs":32}
AdamW
weight_decay: 1e-8
momentum: null
other_params: {"slot_learning_rate":0.001,"slot_steps":3}
Regularization
LN scale
parameters: {"enabled":true}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Other
other
SLOT eval-time augmentation that optimizes a single additive delta vector at the last hidden layer during evaluation.
parameters: {"enabled":true,"learning_rate":0.001,"steps":3}
other
CTW eval-time augmentation was tested as a negative result and did not improve BPB.
parameters: {"weight":0.1,"depth":4}

Novel Contributions

  • First SLOT-based entry in Parameter Golf
  • Eval-time augmentation using SLOT integrated inside the TTT scoring loop
  • Reported consistent BPB improvement across three seeds with minimal overhead
  • Negative-result analysis of CTW as an eval-time augmentation
  • Demonstration that SLOT stacks on top of score-first TTT