PR #1254

open

Record: XSA + LoRA TTT (val_bpb=1.1070)

by Elarwei001View on GitHub

val_bpb

1.1070

Architecture

Transformer

Optimizer

AdamW

Artifact Size

14.4 MB

Training Techniques

Architecture

XSA

Exclusive Self Attention applied to all layers.

parameters: {"layers":11}

GQA

Grouped query attention with 4 KV heads.

parameters: {"attention_heads":8,"kv_heads":4,"d_model":416}

MLP3x

3x MLP expansion with LeakyReLU(0.5)^2 activation.

parameters: {"expansion":3}

LeakyReLU

LeakyReLU(0.5)^2 activation used in the MLP.

parameters: {"slope":0.5}

sliding window eval

Sliding window attention with window size 192.

parameters: {"window_size":192}

RoPE

Rotary positional encoding.

parameters: null

weight tying

Tied embeddings.

parameters: null

Quantization

QAT

bits: 6

scope: all

Compression

zlib

level: null

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01}

Optimizer

AdamW

weight_decay: 0.1

momentum: null

other_params: {"lr":0.001,"gradient_clipping":1}

Sequence Length

sequence_length

train_length: 256

eval_length: 256

Novel Contributions

XSA applied across all layers
LoRA-based test-time training with rank-8 adapters on Q, V, and LM head
Int6 quantization-aware training to fit the artifact size limit
BPE-8192 tokenizer for large BPB gains
Size-optimized 11-layer Transformer configuration with 416-dimensional hidden size