val_bpb
1.0717
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.75 MB
Training Techniques
Architecture
Transformer
10-layer transformer with 512d hidden size, 8/4 GQA, 3x MLP LeakyReLU(0.5)^2, BigramHash, SmearGate, value residual, gated attention, U-Net skip connections, and tied embeddings.
parameters: {"layers":10,"dimensions":512,"gqa":"8/4","bigramhash_buckets":10240}
BigramHash
Hash-based token feature component with 10240 buckets and 128-dimensional representation.
parameters: {"buckets":10240,"dimensions":128}
SmearGate
Gating mechanism used in the transformer blocks.
parameters: null
tied embeddings
Input and output embeddings are tied.
parameters: null
Quantization
mixed int5/int6
bits: null
scope: MLP and attention
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.02}
Weight Averaging
EMA
parameters: {"decay":0.995}
Evaluation
sliding window eval with backward-looking 7-gram cache
parameters: {"order":7,"alpha":0.4,"hash_buckets":4000000,"min_count":2,"score_first":true,"deterministic":true}
Test-Time Training
score-first TTT-like cache update
parameters: {"gradient_updates":false,"ttt":false}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Novel Contributions
- 10-layer transformer with a compact architecture tuned for the competition constraints
- Backward-looking 7-gram evaluation cache to improve validation performance during inference
- Score-first cache update strategy with deterministic evaluation and no gradient-based test-time adaptation
- Mixed int5/int6 quantization combined with zstd-22 compression to fit within the artifact size limit
- Use of EMA, pruning, and BigramHash/SmearGate/U-Net skip architectural enhancements