PR #724

open

Record: 10L + 7-gram eval cache (mean val_bpb=1.0717)

by hypery11View on GitHub
val_bpb
1.0717
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.75 MB

Training Techniques

Architecture
Transformer
10-layer transformer with 512d hidden size, 8/4 GQA, 3x MLP LeakyReLU(0.5)^2, BigramHash, SmearGate, value residual, gated attention, U-Net skip connections, and tied embeddings.
parameters: {"layers":10,"dimensions":512,"gqa":"8/4","bigramhash_buckets":10240}
BigramHash
Hash-based token feature component with 10240 buckets and 128-dimensional representation.
parameters: {"buckets":10240,"dimensions":128}
SmearGate
Gating mechanism used in the transformer blocks.
parameters: null
tied embeddings
Input and output embeddings are tied.
parameters: null
Quantization
mixed int5/int6
bits: null
scope: MLP and attention
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.02}
Weight Averaging
EMA
parameters: {"decay":0.995}
Evaluation
sliding window eval with backward-looking 7-gram cache
parameters: {"order":7,"alpha":0.4,"hash_buckets":4000000,"min_count":2,"score_first":true,"deterministic":true}
Test-Time Training
score-first TTT-like cache update
parameters: {"gradient_updates":false,"ttt":false}
Regularization
weight decay
parameters: {"weight_decay":0.04}

Novel Contributions

  • 10-layer transformer with a compact architecture tuned for the competition constraints
  • Backward-looking 7-gram evaluation cache to improve validation performance during inference
  • Score-first cache update strategy with deterministic evaluation and no gradient-based test-time adaptation
  • Mixed int5/int6 quantization combined with zstd-22 compression to fit within the artifact size limit
  • Use of EMA, pruning, and BigramHash/SmearGate/U-Net skip architectural enhancements