PR #856

open

Progressive Depth + Hedge Mixer — val_bpb 1.1454

by iverbovoyView on GitHub

val_bpb

1.1454

Architecture

Transformer

Optimizer

Muon + Adam

Artifact Size

15.88MB

Training Techniques

Architecture

depth recurrence

Replaced unique transformer blocks with shared blocks repeated across depth to create effective deeper computation with fewer unique parameters.

parameters: {"blocks":3,"repeats":4,"effective_layers":12}

cross-repeat skip

Adds a weighted residual from the previous repeat to make the recurrent depth stateful.

parameters: null

value embeddings

Adds two extra embedding tables mixed into the residual stream at each effective layer with learned scales.

parameters: {"tables":2}

loop embedding

Learns a per-layer vector added before each block as depth-wise positional encoding.

parameters: null

KV head count

Uses 4 KV heads with 8 attention heads.

parameters: {"heads":8,"kv_heads":4}

Evaluation

sliding window eval

parameters: {"stride":256,"window":1024}

Other

other

Hedge Mixer online ensemble at eval time combining neural, unigram, bigram, trigram, and entropy experts via Hedge algorithm using only already-scored tokens.

parameters: {"experts":5}

LR Schedule

warmdown

parameters: {"warmdown_iters":2000}

Regularization

gradient clipping

parameters: {"grad_clip_norm":0.3}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Compression

zlib

level: null

Novel Contributions

Progressive depth / depth recurrence with shared transformer blocks
Cross-Repeat Skip for stateful recurrent depth
Value embeddings mixed into the residual stream
Loop embedding as depth-wise positional encoding
Hedge Mixer online ensemble at evaluation time
Sliding-window evaluation with stride 256
Learning-rate and warmdown tuning