PR #784

open

Non-record: Depth Recurrence + XSA + LeakyReLU² (val_bpb 1.2065)

by iverbovoyView on GitHub

val_bpb

1.2065

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.87MB

Training Techniques

Architecture

depth recurrence

Replaces unique blocks with shared blocks repeated across depth, creating effective layers via recurrence.

parameters: {"blocks":3,"repeats":4,"effective_layers":12,"dim":832}

Cross-Repeat Skip

Adds a learned weighted residual from the previous repeat to make depth recurrence stateful.

parameters: {"repeat_scales":"learned per-repeat"}

XSA

Exclusive Self-Attention applied to the last 4 effective layers.

parameters: {"layers":4}

Value Embeddings

Adds extra embedding tables mixed into the residual stream at each effective layer.

parameters: {"tables":2}

Loop Embedding

Adds a learned per-layer vector before each block as depth-wise positional encoding.

parameters: null

LeakyReLU^2

Uses LeakyReLU(0.5)^2 instead of relu^2.

parameters: {"negative_slope":0.5}

Quantization

GPTQ-lite

bits: 8

scope: all

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

SWA

parameters: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":256,"window":1024}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Novel Contributions

Depth recurrence with Cross-Repeat Skip to turn stateless weight sharing into stateful recurrence
Exclusive Self-Attention on the last 4 effective layers
LeakyReLU(0.5)^2 activation replacing relu^2
Value Embeddings mixed into the residual stream
Loop Embedding as depth-wise positional encoding
GPTQ-lite post-training quantization with best-of-5 clip percentiles
zstd-22 compression and SWA for artifact optimization