PR #1045

open

[Non-Record] XSA-all-layers + VRL + bigram3072 + lzma9 — 1.1509 bpb, AdamW TTT findings

by Hilo-HiloView on GitHub
val_bpb
1.1509
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.3MB

Training Techniques

Architecture
XSA
Cross-attention applied to all layers instead of only the last few layers.
parameters: {"layers":11}
Value Residual
Adds residual value gating (V = V + residual_V).
parameters: null
BigramHash
3072-vocab bigram head with reduced embedding dimension.
parameters: {"vocab_size":3072,"dimensions":112}
Quantization
STE QAT
bits: 6
scope: all
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3}
Optimizer
AdamW
weight_decay: 0.01
momentum: null
other_params: {"betas":[0.9,0.999],"eps":1e-8}

Novel Contributions

  • XSA applied to all 11 layers of the 11L d512 stack
  • Value Residual Learning added on XSA layers
  • bigram3072 head with dimension 112
  • lzma preset 9 used to reduce artifact size
  • Measured that AdamW TTT at lr=0.002 significantly degrades performance compared with no TTT