PR #1045

open

[Non-Record] XSA-all-layers + VRL + bigram3072 + lzma9 — 1.1509 bpb, AdamW TTT findings

val_bpb

1.1509

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.3MB

Training Techniques

Architecture

XSA

Cross-attention applied to all layers instead of only the last few layers.

parameters: {"layers":11}

Value Residual

Adds residual value gating (V = V + residual_V).

parameters: null

BigramHash

3072-vocab bigram head with reduced embedding dimension.

parameters: {"vocab_size":3072,"dimensions":112}

Quantization

STE QAT

bits: 6

scope: all

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3}

Optimizer

AdamW

weight_decay: 0.01

momentum: null

other_params: {"betas":[0.9,0.999],"eps":1e-8}

XSA applied to all 11 layers of the 11L d512 stack
Value Residual Learning added on XSA layers
bigram3072 head with dimension 112
lzma preset 9 used to reduce artifact size
Measured that AdamW TTT at lr=0.002 significantly degrades performance compared with no TTT