val_bpb
1.1744
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,328,877 bytes
Training Techniques
Architecture
tied embeddings
Uses tied input/output embeddings in a 9-layer Transformer.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"combined_with":"Adam","training_phase":"pretraining"}
Quantization
int8
bits: 8
scope: full model artifact
Compression
zlib
level: null
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":2,"momentum":0.9,"batch_size":32}
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":1024}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
Other
other
Decompresses an int8+zlib artifact back to full precision before test-time adaptation.
parameters: null
Novel Contributions
- Test-time training during evaluation to adapt the full model on validation data
- Full-model SGD adaptation instead of LoRA-based TTT
- Use of the evaluation budget as additional optimization time for improved BPB
- Int8 plus zlib artifact compression to fit within the submission size cap