PR #807

open

Non-record: Sequential Momentum TTT (val_bpb=1.0116, 3-seed mean, 4xA10G)

by connectwithprakashView on GitHub

val_bpb

1.0116

Architecture

10-layer GQA Transformer

Optimizer

Muon

Artifact Size

10.85 MB

Training Techniques

Architecture

XSA4

Attention/sequence architecture modification used in the model.

parameters: null

SmearGate

Gating mechanism added to the model.

parameters: null

BigramHash

Bigram hashing component used to enrich token interactions.

parameters: {"dimensions":4096}

MLP3x

Expanded MLP width to 3x.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

EMA

parameters: {"decay":0.997}

Test-Time Training

LoRA TTT

parameters: {"momentum":0.3,"sequential":true,"cross_document":true}

Initialization

asymmetric LoRA initialization

A is initialized with kaiming plus EMA, while B is initialized from EMA only.

Quantization

mixed int5/int6

bits: null

scope: MLP and attention weights

Compression

lzma

level: 6

Evaluation

full evaluation

parameters: {"seeds":[1337,42,2025]}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

magnitude pruning

parameters: {"sparsity":0.03}

Other

other

Learned activation mixing using relu^2 and leaky_relu(0.5)^2 blend.

parameters: null

Novel Contributions

Sequential Momentum TTT with cross-document LoRA EMA during test-time training
Warm-starting LoRA adapters across document batches using an EMA of prior batch weights
Asymmetric LoRA initialization where A uses kaiming plus EMA and B uses EMA only
Mixed int5/int6 quantization combined with LZMA compression to fit under the artifact limit