PR #841

open

Add 11L XSA11 + BigramHash3072 + AdamW Legal TTT submission

by someone114514View on GitHub

val_bpb

1.1157

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,983,339 bytes

Training Techniques

Architecture

XSA

XSA enabled on all 11 transformer layers

parameters: {"layers":11}

BigramHash

BigramHash token representation with hashed buckets and learned dimension

parameters: {"buckets":3072,"dim":112}

tied embeddings

Input and output embeddings are tied

parameters: null

Partial RoPE

Uses partial rotary positional embeddings

parameters: null

KV head count

Uses 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

MLP3x

Three-layer MLP with LeakyReLU activations

parameters: {"mlp_layers":3}

Optimizer

Parallel Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.025}

AdamW

weight_decay: 0.01

momentum: null

other_params: {"learning_rate":0.0001,"scope":"embeddings/scalars"}

Weight Averaging

EMA + SWA

parameters: {"swa":"tight","ema":true}

Quantization

int6

bits: 6

scope: final artifact export

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first legal TTT

parameters: {"optimizer":"AdamW","chunk_size":131072,"epochs":3,"freeze_blocks":8,"learning_rate":0.0001,"weight_decay":0.01,"momentum":0.9}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

layerwise LN scale

parameters: null

Novel Contributions

11-layer 512-dimensional transformer with XSA enabled on all layers
BigramHash with 3072 buckets and 112-dimensional representation
Parameter Banking with Parallel Muon for matrix weights
Score-first legal test-time training using AdamW
Int6 + lzma export to fit within the 16MB artifact limit