val_bpb
1.1070
Architecture
Transformer
Optimizer
AdamW
Artifact Size
14.4 MB
Training Techniques
Architecture
XSA
Exclusive Self Attention applied to all layers.
parameters: {"layers":11}
GQA
Grouped query attention with 4 KV heads.
parameters: {"attention_heads":8,"kv_heads":4,"d_model":416}
MLP3x
3x MLP expansion with LeakyReLU(0.5)^2 activation.
parameters: {"expansion":3}
LeakyReLU
LeakyReLU(0.5)^2 activation used in the MLP.
parameters: {"slope":0.5}
sliding window eval
Sliding window attention with window size 192.
parameters: {"window_size":192}
RoPE
Rotary positional encoding.
parameters: null
weight tying
Tied embeddings.
parameters: null
Quantization
QAT
bits: 6
scope: all
Compression
zlib
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01}
Optimizer
AdamW
weight_decay: 0.1
momentum: null
other_params: {"lr":0.001,"gradient_clipping":1}
Sequence Length
sequence_length
train_length: 256
eval_length: 256
Novel Contributions
- XSA applied across all layers
- LoRA-based test-time training with rank-8 adapters on Q, V, and LM head
- Int6 quantization-aware training to fit the artifact size limit
- BPE-8192 tokenizer for large BPB gains
- Size-optimized 11-layer Transformer configuration with 416-dimensional hidden size