PR #303

open

[Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436)

by sseanliuView on GitHub
val_bpb
1.1436
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
XSA
Exclusive Self-Attention on the last layers to remove self-information from attention outputs.
parameters: {"last_n_layers":4}
SmearGate
Gating mechanism used in the base model.
parameters: null
BigramHash
Bigram hashing vocabulary mechanism.
parameters: {"vocab_size":2048}
MLP3x
Transformer MLP with 3x expansion.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Initialization
OrthoInit
Orthogonal initialization.
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"freeze_blocks":2,"momentum":0.9,"gradient_clipping":1}
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
zstd
level: null

Novel Contributions

  • Tests whether TTT improves an XSA + EMA base model.
  • Finds that TTT hurts performance on the XSA + EMA model by 0.016 BPB.
  • Provides a negative interaction study showing XSA and TTT are mechanistically redundant.
  • Uses FA2 instead of FA3 due to environment constraints.
  • Reports reproducibility across two seeds.