PR #1384

open

Progressive Depth + Hedge Mixer — val_bpb 1.1441 (3-seed mean)

by iverbovoyView on GitHub

val_bpb

1.1441

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.88 MB

Training Techniques

Architecture

depth recurrence

3 shared transformer blocks repeated across 4 repeats with cross-repeat skip connections, loop embeddings, and value embeddings.

parameters: {"layers":3,"repeats":4,"effective_layers":12}

XSA

Exclusive self-attention applied on the last 4 effective layers to reduce attention collapse in deep recurrent models.

parameters: {"layers":4}

LeakyReLU

LeakyReLU(0.5)^2 activation used to improve gradient flow in deep/recurrent models.

parameters: {"slope":0.5}

weight tying

Shared weights across repeated blocks.

parameters: null

BigramHash

Bigram-based hashed context used as part of the Hedge Mixer evaluation ensemble.

parameters: null

TrigramHash

Hashed trigram context used as part of the Hedge Mixer evaluation ensemble.

parameters: {"buckets":65000}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"matrix_params":"Muon","scalar_params":"Adam","tied_embed_lr":0.015}

Weight Averaging

SWA

parameters: {"start":"warmdown","interval_steps":50,"checkpoints_averaged":"13-16"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: null

Other

other

Hedge Mixer online ensemble evaluation combining neural, unigram, bigram, trigram, and entropy experts.

parameters: {"experts":5,"eta":0.1,"initial_log_weight_neural":2}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

logit softcap

parameters: {"value":30}

Novel Contributions

Progressive depth training with increasing repeats over time
Depth recurrence with shared blocks, cross-repeat skip connections, loop embeddings, and value embeddings
Hedge Mixer 5-expert online ensemble for evaluation-time improvement
Clean 3-seed validation of the submission