PR #686

open

Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182

by msisovicView on GitHub

val_bpb

1.1182

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9MB

Training Techniques

Architecture

depth recurrence

Re-executes mid-network layers with independent learnable block scalars to create more virtual layers without increasing model size much.

parameters: {"recur_layers":[4,5],"physical_layers":11,"virtual_layers":13}

Quantization

int6

bits: 6

scope: all

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_from":0.92,"warmup_steps":1500}

Weight Averaging

SWA

parameters: {"every":50}

Evaluation

stride-based eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":2,"untie":false}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"matrix_weight_decay":0.04,"adam_weight_decay":0.04}

Other

other

Uses independent learnable block scalars for recurrent layer passes.

parameters: {"added_params":"~2K"}

Novel Contributions

Dual depth recurrence on layers 4 and 5 to create 13 virtual layers from 11 physical layers
Independent learnable block scalars for repeated layer passes
Achieves near-independent-depth performance gains while staying under the artifact budget
Confirms tied TTT performs equivalently to untied for recurrent layers