val_bpb
1.1259
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,943,528 bytes
Training Techniques
Architecture
GQA
Grouped query attention used in the model.
parameters: {"num_heads":8,"num_kv_heads":4}
XSA
XSA applied to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied partially.
parameters: {"dimensions":16}
SmearGate
SmearGate used in the architecture.
parameters: null
BigramHash
Bigram hash component used in the architecture.
parameters: null
ValueEmbedding
Value embedding component used in the architecture.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adamw":true}
Weight Averaging
EMA
parameters: null
Quantization
late QAT
bits: null
scope: mixed int6/int8
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"enabled":1,"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0}
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 32768
eval_length: null
Other
other
8xH100 SXM training on Modal with a 600s wallclock cap.
parameters: {"gpus":8,"wallclock_seconds":600}
Novel Contributions
- Competitive 8xH100 run package with full submission artifacts
- Best legal score-first TTT exact metric of 1.12587738 bpb
- Mixed int6/int8 quantized artifact kept under the 16MB submission cap
- Use of GQA, XSA, Partial RoPE, SmearGate, BigramHash, and ValueEmbedding
- EMA plus Muon/AdamW training stack with late QAT
- Sliding-window evaluation combined with score-first test-time training