PR #841
openAdd 11L XSA11 + BigramHash3072 + AdamW Legal TTT submission
by someone114514View on GitHub
val_bpb
1.1157
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,983,339 bytes
Training Techniques
Architecture
XSA
XSA enabled on all 11 transformer layers
parameters: {"layers":11}
BigramHash
BigramHash token representation with hashed buckets and learned dimension
parameters: {"buckets":3072,"dim":112}
tied embeddings
Input and output embeddings are tied
parameters: null
Partial RoPE
Uses partial rotary positional embeddings
parameters: null
KV head count
Uses 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
MLP3x
Three-layer MLP with LeakyReLU activations
parameters: {"mlp_layers":3}
Optimizer
Parallel Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.025}
AdamW
weight_decay: 0.01
momentum: null
other_params: {"learning_rate":0.0001,"scope":"embeddings/scalars"}
Weight Averaging
EMA + SWA
parameters: {"swa":"tight","ema":true}
Quantization
int6
bits: 6
scope: final artifact export
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first legal TTT
parameters: {"optimizer":"AdamW","chunk_size":131072,"epochs":3,"freeze_blocks":8,"learning_rate":0.0001,"weight_decay":0.01,"momentum":0.9}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
layerwise LN scale
parameters: null
Novel Contributions
- 11-layer 512-dimensional transformer with XSA enabled on all layers
- BigramHash with 3072 buckets and 112-dimensional representation
- Parameter Banking with Parallel Muon for matrix weights
- Score-first legal test-time training using AdamW
- Int6 + lzma export to fit within the 16MB artifact limit