PR #148

open

Depth Recurrence + Cross-Repeat Skip + Sliding Window Eval

by iverbovoyView on GitHub
val_bpb
1.2196
Architecture
Transformer
Optimizer
Muon + Adam
Artifact Size
12.83MB

Training Techniques

Quantization
int8
bits: 8
scope: all
Architecture
depth recurrence
Replaced 9 unique transformer blocks with 3 shared blocks repeated 4 times, creating 12 effective layers.
parameters: {"shared_blocks":3,"repeats":4,"effective_layers":12}
Cross-Repeat Skip
Adds a weighted residual of each block's output from the previous repeat to make recurrence stateful.
parameters: {"learned_scales":true}
Value Embeddings
Adds 2 extra embedding tables mixed into the residual stream at each effective layer.
parameters: {"tables":2}
Loop Embedding
Adds a learned per-layer vector before each block as depth-wise positional encoding.
parameters: null
KV head count
Uses 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Evaluation
sliding window eval
parameters: {"window":1024,"stride":256}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Optimizer
Muon + Adam
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.012,"scalar_lr":0.012,"tied_embed_lr":0.015,"grad_clip_norm":0.3}

Novel Contributions

  • Depth recurrence via shared transformer blocks repeated across depth
  • Cross-Repeat Skip for stateful recurrence across repeats
  • Value Embeddings mixed into the residual stream
  • Loop Embedding as depth-wise positional encoding
  • Sliding window evaluation with stride 256
  • Lower learning rate tuned for recurrent depth amplification