PR #1182
open[track_10min_16mb] XSA7 + BigramHash + ValueResidual + Legal TTT — val_bpb=1.1227
by adityakm24View on GitHub
val_bpb
1.1227
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,944,685 bytes
Training Techniques
Architecture
XSA
Cross-sequence attention applied to the last 7 layers.
parameters: {"layers":7}
BigramHash
Bigram hash embeddings for token context augmentation.
parameters: {"buckets":2048,"dimensions":96}
TrigramHash
Trigram hash embeddings for token context augmentation.
parameters: {"buckets":1024,"dimensions":128}
Value Residual
ResFormer-style value residual connections.
parameters: null
VE128
Token identity reinjection via value embedding at selected layers.
parameters: {"layers":[5,9,10],"dimensions":128}
LeakyReLU
LeakyReLU squared activation in the MLP.
parameters: {"squared":true,"negative_slope":0.5}
GQA
Grouped query attention with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16}
Quantization
late QAT
bits: 6
scope: all
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_steps":5}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"non-matrix params"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"interval":50}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":4,"freeze_blocks":0}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Novel Contributions
- 11-layer parameter-banking GPT with XSA on the last 7 layers
- BigramHash and TrigramHash n-gram hash embeddings
- Value Residual connections and value embedding reinjection
- LeakyReLU squared MLP activation
- Legal score-first test-time training
- int6 quantization with LZMA compression under the 16MB cap
- Parallel Muon optimization with parameter banking