PR #1118
openSubmission: 11L XSA4 + TrigramHash + ValueResidual + Legal TTT (val_bpb=1.1187)
by adityakm24View on GitHub
val_bpb
1.1187
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,985,833 bytes
Training Techniques
Architecture
XSA
XSA applied to the last 4 layers
parameters: {"layers":4}
TrigramHash
Trigram hash embedding used in the model
parameters: {"dimensions":1024}
BigramHash
Bigram hash embedding used in the model
parameters: {"dimensions":1536}
SmearGate
SmearGate used in the model
parameters: null
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16}
GQA
Grouped query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
ValueResidual
Value residual connection used in the model
parameters: null
ValueEmbedding
Value embedding used in the model
parameters: null
Quantization
late QAT
bits: null
scope: artifact
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0025,"epochs":6,"freeze_blocks":0}
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 9000
eval_length: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"adamw":true}
Novel Contributions
- 11-layer Transformer with GQA, XSA on the last 4 layers, Partial RoPE, SmearGate, BigramHash, TrigramHash, ValueEmbedding, and ValueResidual
- Parallel Muon + AdamW optimization with EMA and SWA
- Late QAT and int6+lzma artifact compression to fit under the 16MB limit
- Sliding-window evaluation combined with legal score-first TTT
- Achieved val_bpb=1.11868501 with total artifact size 15,985,833 bytes