PR #1240
openNon-record: Does SLOT violate causal dependence? (empirical test + question)
by andrewbaggio1View on GitHub
val_bpb
1.1064
Architecture
Transformer
Optimizer
AdamW
Artifact Size
—
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.002,"chunk_tokens":65536}
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: null
AdamW
weight_decay: null
momentum: null
other_params: {"steps":8,"learning_rate":0.005}
Architecture
LeakyReLU
Uses LeakyReLU activation in the model architecture.
parameters: {"slope":0.5}
GQA
Grouped query attention with reduced KV heads.
parameters: {"kv_heads":4}
BigramHash
Bigram hash embedding component.
parameters: {"hash_size":2816,"embedding_dim":112}
SmearGate
SmearGate gating mechanism.
parameters: null
XSA
XSA attention/sequence component.
parameters: null
Partial RoPE
Rotary positional embeddings applied only to part of the head dimension.
parameters: {"dimensions":16}
MLP3x
MLP widened to 3x.
parameters: null
VE128
VE128 architectural component.
parameters: null
KV head count
Uses 4 KV heads in grouped query attention.
parameters: {"heads":4}
Weight Averaging
SWA
parameters: null
EMA
parameters: null
Initialization
OrthoInit
Orthogonal initialization.
Regularization
LN scale
parameters: null
logit softcap
parameters: null
Compression
lzma
level: null
Novel Contributions
- Empirical test claiming SLOT violates causal dependence by changing NLL at other scored positions when a target token is flipped.
- Self-prediction comparison showing better P(x_{t+1}) when the token is included among optimization targets.
- Combination of full Hessian GPTQ, score-first chunked TTT, and per-batch delta SLOT optimization.
- Per-batch delta optimization for SLOT with delta reset each batch and optimization over frozen hidden states.