PR #807
openNon-record: Sequential Momentum TTT (val_bpb=1.0116, 3-seed mean, 4xA10G)
by connectwithprakashView on GitHub
val_bpb
1.0116
Architecture
10-layer GQA Transformer
Optimizer
Muon
Artifact Size
10.85 MB
Training Techniques
Architecture
XSA4
Attention/sequence architecture modification used in the model.
parameters: null
SmearGate
Gating mechanism added to the model.
parameters: null
BigramHash
Bigram hashing component used to enrich token interactions.
parameters: {"dimensions":4096}
MLP3x
Expanded MLP width to 3x.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.997}
Test-Time Training
LoRA TTT
parameters: {"momentum":0.3,"sequential":true,"cross_document":true}
Initialization
asymmetric LoRA initialization
A is initialized with kaiming plus EMA, while B is initialized from EMA only.
Quantization
mixed int5/int6
bits: null
scope: MLP and attention weights
Compression
lzma
level: 6
Evaluation
full evaluation
parameters: {"seeds":[1337,42,2025]}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
magnitude pruning
parameters: {"sparsity":0.03}
Other
other
Learned activation mixing using relu^2 and leaky_relu(0.5)^2 blend.
parameters: null
Novel Contributions
- Sequential Momentum TTT with cross-document LoRA EMA during test-time training
- Warm-starting LoRA adapters across document batches using an EMA of prior batch weights
- Asymmetric LoRA initialization where A uses kaiming plus EMA and B uses EMA only
- Mixed int5/int6 quantization combined with LZMA compression to fit under the artifact limit