val_bpb
1.1187
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,985,833 bytes
Training Techniques
Architecture
GQA
Grouped query attention used in the model.
parameters: {"num_heads":8,"num_kv_heads":4}
XSA
XSA applied to the last layers of the model.
parameters: {"layers":4}
Partial RoPE
Partial rotary positional embeddings.
parameters: {"rope_dims":16}
SmearGate
SmearGate component included in the architecture.
parameters: null
BigramHash
Bigram hash embeddings used in the model.
parameters: {"size":1536}
TrigramHash
Trigram hash embeddings used in the model.
parameters: {"size":1024}
ValueEmbedding
Value embeddings included in the architecture.
parameters: null
ValueResidual
Value residual connections included in the architecture.
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"adamw":true}
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Quantization
late QAT
bits: null
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0025,"epochs":6,"freeze_blocks":0}
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- Legal score-first test-time training to improve validation bpb
- Int6 + lzma artifact packaging under the 16MB submission cap
- Parameter-banking Transformer with GQA, XSA, Partial RoPE, SmearGate, BigramHash, TrigramHash, ValueEmbedding, and ValueResidual
- Parallel Muon + AdamW optimization with EMA and SWA
- Sliding-window evaluation with stride 64