PR #1414
openNon-record: Discriminative TTT + SLOT-24, 3-seed verified (8xH100 SXM)
by Abhishek8108View on GitHub
val_bpb
0.7093
Architecture
Transformer
Optimizer
AdamW
Artifact Size
16.15 MB
Training Techniques
Test-Time Training
TTT
parameters: {"variant":"Discriminative TTT","per_block_adaptive_lr":true,"pre_quantization":true}
Other
other
SLOT-24 per-sample evaluation-time delta optimization with hidden-state delta plus logit bias
parameters: {"steps":24}
Quantization
late QAT
bits: 6
scope: all
int6
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"start_frac":0.2,"every":50}
Architecture
GQA
Grouped query attention with fewer KV heads than query heads
parameters: {"num_heads":8,"num_kv_heads":4}
LeakyReLU
LeakyReLU squared activation
parameters: {"squared":true,"alpha":0.5}
XSA
Last 4 layers use XSA
parameters: {"layers":4}
Partial RoPE
Partial rotary position embeddings
parameters: {"dimensions":16,"base":64}
LN Scale
LayerNorm scale modification
parameters: null
BigramHash
Bigram hash embedding
parameters: {"vocab_size":2048,"dim":128}
SmearGate
SmearGate component
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":null}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
LN scale
parameters: null
Novel Contributions
- Combining discriminative TTT with SLOT-24 and verifying the result across 3 seeds
- Demonstrating that SLOT dominates and that pre-SLOT model improvements have diminishing returns
- Introducing quantization noise annealing (QNA) to train for quantization robustness
- Introducing stochastic quantized weight averaging (SQWA) to average quantize-dequantize snapshots in the quantization-friendly subspace
- Showing that QNA and SQWA reduce quantization gap but do not improve leaderboard BPB