PR #148

open

Depth Recurrence + Cross-Repeat Skip + Sliding Window Eval

by iverbovoyView on GitHub

val_bpb

1.2196

Architecture

Transformer

Optimizer

Muon + Adam

Artifact Size

12.83MB

Training Techniques

Quantization

int8

bits: 8

scope: all

Architecture

depth recurrence

Replaced 9 unique transformer blocks with 3 shared blocks repeated 4 times, creating 12 effective layers.

parameters: {"shared_blocks":3,"repeats":4,"effective_layers":12}

Cross-Repeat Skip

Adds a weighted residual of each block's output from the previous repeat to make recurrence stateful.

parameters: {"learned_scales":true}

Value Embeddings

Adds 2 extra embedding tables mixed into the residual stream at each effective layer.

parameters: {"tables":2}

Loop Embedding

Adds a learned per-layer vector before each block as depth-wise positional encoding.

parameters: null

KV head count

Uses 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Evaluation

sliding window eval

parameters: {"window":1024,"stride":256}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Optimizer

Muon + Adam

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.012,"scalar_lr":0.012,"tied_embed_lr":0.015,"grad_clip_norm":0.3}

Novel Contributions

Depth recurrence via shared transformer blocks repeated across depth
Cross-Repeat Skip for stateful recurrence across repeats
Value Embeddings mixed into the residual stream
Loop Embedding as depth-wise positional encoding
Sliding window evaluation with stride 256
Lower learning rate tuned for recurrent depth amplification