PR #1209
openRecord: Full GPTQ + Score-First TTT + SLOT — val_bpb 1.1064 (3-seed mean)
by andrewbaggio1View on GitHub
val_bpb
1.1064
Architecture
Transformer
Optimizer
AdamW
Artifact Size
—
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
late QAT
bits: null
scope: all
Test-Time Training
score-first TTT
parameters: {"epochs":3,"chunk_tokens":65536,"learning_rate":0.002}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"used_for":"TTT"}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"SLOT","steps":8,"learning_rate":0.005}
Architecture
LeakyReLU
Uses LeakyReLU squared activation in the model stack.
parameters: {"negative_slope":0.5}
GQA
Grouped query attention with reduced KV heads.
parameters: {"kv_heads":4,"query_groups":8}
MLP3x
Expanded MLP width to 3x.
parameters: null
BigramHash
Bigram hash embedding component.
parameters: {"vocab_size":2816,"dim":112}
SmearGate
SmearGate gating mechanism.
parameters: null
XSA
XSA attention variant.
parameters: null
Partial RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":16}
VE128
VE128 architectural component.
parameters: null
Regularization
LN scale
parameters: null
logit softcap
parameters: null
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Initialization
OrthoInit
Orthogonal initialization.
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Combines full Hessian GPTQ with score-first chunked TTT and SLOT
- Uses score-before-update compliant single-pass evaluation
- Applies per-batch delta vector optimization in hidden space (SLOT)
- Reports a 3-seed mean val_bpb of 1.1064