PR #1254

open

Record: XSA + LoRA TTT (val_bpb=1.1070)

by Elarwei001View on GitHub
val_bpb
1.1070
Architecture
Transformer
Optimizer
AdamW
Artifact Size
14.4 MB

Training Techniques

Architecture
XSA
Exclusive Self Attention applied to all layers.
parameters: {"layers":11}
GQA
Grouped query attention with 4 KV heads.
parameters: {"attention_heads":8,"kv_heads":4,"d_model":416}
MLP3x
3x MLP expansion with LeakyReLU(0.5)^2 activation.
parameters: {"expansion":3}
LeakyReLU
LeakyReLU(0.5)^2 activation used in the MLP.
parameters: {"slope":0.5}
sliding window eval
Sliding window attention with window size 192.
parameters: {"window_size":192}
RoPE
Rotary positional encoding.
parameters: null
weight tying
Tied embeddings.
parameters: null
Quantization
QAT
bits: 6
scope: all
Compression
zlib
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01}
Optimizer
AdamW
weight_decay: 0.1
momentum: null
other_params: {"lr":0.001,"gradient_clipping":1}
Sequence Length
sequence_length
train_length: 256
eval_length: 256

Novel Contributions

  • XSA applied across all layers
  • LoRA-based test-time training with rank-8 adapters on Q, V, and LM head
  • Int6 quantization-aware training to fit the artifact size limit
  • BPE-8192 tokenizer for large BPB gains
  • Size-optimized 11-layer Transformer configuration with 416-dimensional hidden size