PR #1240

open

Non-record: Does SLOT violate causal dependence? (empirical test + question)

by andrewbaggio1View on GitHub

val_bpb

1.1064

Architecture

Transformer

Optimizer

AdamW

Artifact Size

—

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Test-Time Training

score-first TTT

parameters: {"epochs":3,"learning_rate":0.002,"chunk_tokens":65536}

Evaluation

sliding window eval

parameters: {"stride":64}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: null

AdamW

weight_decay: null

momentum: null

other_params: {"steps":8,"learning_rate":0.005}

Architecture

LeakyReLU

Uses LeakyReLU activation in the model architecture.

parameters: {"slope":0.5}

GQA

Grouped query attention with reduced KV heads.

parameters: {"kv_heads":4}

BigramHash

Bigram hash embedding component.

parameters: {"hash_size":2816,"embedding_dim":112}

SmearGate

SmearGate gating mechanism.

parameters: null

XSA

XSA attention/sequence component.

parameters: null

Partial RoPE

Rotary positional embeddings applied only to part of the head dimension.

parameters: {"dimensions":16}

MLP3x

MLP widened to 3x.

parameters: null

VE128

VE128 architectural component.

parameters: null

KV head count

Uses 4 KV heads in grouped query attention.

parameters: {"heads":4}

Weight Averaging

SWA

parameters: null

EMA

parameters: null

Initialization

OrthoInit

Orthogonal initialization.

Regularization

LN scale

parameters: null

logit softcap

parameters: null

Compression

lzma

level: null

Novel Contributions

Empirical test claiming SLOT violates causal dependence by changing NLL at other scored positions when a target token is flipped.
Self-prediction comparison showing better P(x_{t+1}) when the token is included among optimization targets.
Combination of full Hessian GPTQ, score-first chunked TTT, and per-batch delta SLOT optimization.
Per-batch delta optimization for SLOT with delta reset each batch and optimization over frozen hidden states.