PR #752

open

Record: Depth Recurrence + SGD TTT : 1.1182 BPB

by Naazimsnh02View on GitHub

val_bpb

1.1182

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.93 MB

Training Techniques

Architecture

depth recurrence

Repeats layers 4 and 5 to create 13 virtual layers from 11 physical layers at zero parameter cost.

parameters: {"layers":[4,5],"physical_layers":11,"virtual_layers":13}

BigramHash

Adds a BigramHash module with vocabulary size 2048.

parameters: {"vocab_size":2048,"dim":128}

XSA

Uses XSA on the last 4 layers.

parameters: {"last_n_layers":4}

Partial RoPE

Applies rotary positional embeddings to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

MLP3x

Uses a 3x MLP block with LeakyReLU(0.5)^2.

parameters: {"activation":"LeakyReLU(0.5)^2"}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"all_blocks_unfrozen":true}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency":50,"description":"tight SWA weight averaging"}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"chunk_tokens":32768,"all_blocks_unfrozen":true}

LR Schedule

cosine decay

parameters: {"across_chunks":true}

Other

other

Legal score-first test-time training with backward-looking chunk adaptation; each chunk is scored before being trained on, and the last chunk is scored but never trained on.

parameters: {"chunks":1893}

Novel Contributions

Depth recurrence on layers 4 and 5 to create 13 virtual layers from 11 physical layers with zero parameter cost
First successful use of depth recurrence on the leaderboard
Legal score-first SGD test-time training applied on top of the base model
Combination of depth recurrence with SGD TTT to improve BPB from 1.1208 to 1.1182