PR #724

open

Record: 10L + 7-gram eval cache (mean val_bpb=1.0717)

by hypery11View on GitHub

val_bpb

1.0717

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.75 MB

Training Techniques

Architecture

Transformer

10-layer transformer with 512d hidden size, 8/4 GQA, 3x MLP LeakyReLU(0.5)^2, BigramHash, SmearGate, value residual, gated attention, U-Net skip connections, and tied embeddings.

parameters: {"layers":10,"dimensions":512,"gqa":"8/4","bigramhash_buckets":10240}

BigramHash

Hash-based token feature component with 10240 buckets and 128-dimensional representation.

parameters: {"buckets":10240,"dimensions":128}

SmearGate

Gating mechanism used in the transformer blocks.

parameters: null

tied embeddings

Input and output embeddings are tied.

parameters: null

Quantization

mixed int5/int6

bits: null

scope: MLP and attention

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.02}

Weight Averaging

EMA

parameters: {"decay":0.995}

Evaluation

sliding window eval with backward-looking 7-gram cache

parameters: {"order":7,"alpha":0.4,"hash_buckets":4000000,"min_count":2,"score_first":true,"deterministic":true}

Test-Time Training

score-first TTT-like cache update

parameters: {"gradient_updates":false,"ttt":false}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Novel Contributions

10-layer transformer with a compact architecture tuned for the competition constraints
Backward-looking 7-gram evaluation cache to improve validation performance during inference
Score-first cache update strategy with deterministic evaluation and no gradient-based test-time adaptation
Mixed int5/int6 quantization combined with zstd-22 compression to fit within the artifact size limit
Use of EMA, pruning, and BigramHash/SmearGate/U-Net skip architectural enhancements