PR #303

open

[Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436)

by sseanliuView on GitHub

val_bpb

1.1436

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.3MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

XSA

Exclusive Self-Attention on the last layers to remove self-information from attention outputs.

parameters: {"last_n_layers":4}

SmearGate

Gating mechanism used in the base model.

parameters: null

BigramHash

Bigram hashing vocabulary mechanism.

parameters: {"vocab_size":2048}

MLP3x

Transformer MLP with 3x expansion.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Initialization

OrthoInit

Orthogonal initialization.

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"freeze_blocks":2,"momentum":0.9,"gradient_clipping":1}

Evaluation

sliding window eval

parameters: {"stride":64}

Compression

zstd

level: null

Novel Contributions

Tests whether TTT improves an XSA + EMA base model.
Finds that TTT hurts performance on the XSA + EMA model by 0.016 BPB.
Provides a negative interaction study showing XSA and TTT are mechanistically redundant.
Uses FA2 instead of FA3 due to environment constraints.
Reports reproducibility across two seeds.