PR #1313

open

Record: SLOT-24 Aggressive — val_bpb 0.8637 (3-seed mean)

by anthony-maioView on GitHub
val_bpb
0.8637
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.7-15.8 MB

Training Techniques

Quantization
late QAT
bits: null
scope: all
late QAT
bits: 6
scope: all
Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU activation used in the MLP.
parameters: {"slope":0.5}
MLP3x
Three-layer MLP.
parameters: {"layers":3}
VE128
Value residual with VE128.
parameters: {"dimensions":128}
BigramHash
Bigram hash embedding with 1024 buckets.
parameters: {"buckets":1024}
XSA
XSA applied across all 11 layers.
parameters: {"layers":11}
SmearGate
SmearGate component used in the architecture.
parameters: null
U-Net skip connections
U-Net style skip connections.
parameters: null
Partial RoPE
Partial rotary positional embeddings.
parameters: {"train_eval_ratio":"16/64"}
LN Scale
Layer norm scale modification.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":96}
Test-Time Training
score-first TTT
parameters: {"steps":24,"learning_rate":0.012}
Regularization
weight decay
parameters: {"weight_decay":1e-8}
LR Schedule
cosine decay
parameters: {"start_lr":0.012,"end_lr":0.001}

Novel Contributions

  • SLOT-24 eval-time hyperparameter tuning
  • 24-step score-first SLOT adaptation
  • Per-sample hidden delta plus logit bias optimization
  • Scored-position masking with stride 96
  • 3-seed mean validation improvement to 0.8637 bpb