PR #1384
openProgressive Depth + Hedge Mixer — val_bpb 1.1441 (3-seed mean)
by iverbovoyView on GitHub
val_bpb
1.1441
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.88 MB
Training Techniques
Architecture
depth recurrence
3 shared transformer blocks repeated across 4 repeats with cross-repeat skip connections, loop embeddings, and value embeddings.
parameters: {"layers":3,"repeats":4,"effective_layers":12}
XSA
Exclusive self-attention applied on the last 4 effective layers to reduce attention collapse in deep recurrent models.
parameters: {"layers":4}
LeakyReLU
LeakyReLU(0.5)^2 activation used to improve gradient flow in deep/recurrent models.
parameters: {"slope":0.5}
weight tying
Shared weights across repeated blocks.
parameters: null
BigramHash
Bigram-based hashed context used as part of the Hedge Mixer evaluation ensemble.
parameters: null
TrigramHash
Hashed trigram context used as part of the Hedge Mixer evaluation ensemble.
parameters: {"buckets":65000}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_params":"Muon","scalar_params":"Adam","tied_embed_lr":0.015}
Weight Averaging
SWA
parameters: {"start":"warmdown","interval_steps":50,"checkpoints_averaged":"13-16"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: null
Other
other
Hedge Mixer online ensemble evaluation combining neural, unigram, bigram, trigram, and entropy experts.
parameters: {"experts":5,"eta":0.1,"initial_log_weight_neural":2}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
logit softcap
parameters: {"value":30}
Novel Contributions
- Progressive depth training with increasing repeats over time
- Depth recurrence with shared blocks, cross-repeat skip connections, loop embeddings, and value embeddings
- Hedge Mixer 5-expert online ensemble for evaluation-time improvement
- Clean 3-seed validation of the submission