PR #1783

open

[record] val_bpb=1.1716 — DEQ Universal Transformer + Seed-LoRA + Mixture of Depths

by ismailntlView on GitHub

val_bpb

1.1716

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

depth recurrence

Recurrence through layers 3-5 with 4 virtual passes, increasing effective depth from a smaller physical stack.

parameters: {"layers":3,"passes":4}

weight tying

Tied embeddings are used to share parameters between input and output embeddings.

parameters: null

LeakyReLU

Uses LeakyReLU(0.5)^2 in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

GQA

Uses grouped-query style attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Uses sigmoid-gated skip connections between layers.

parameters: null

DEQ Universal Transformer

A single physical transformer block is iterated to a fixed point using Anderson acceleration and phantom gradients.

parameters: {"history_window":5,"unrolled_steps":4}

Seed-LoRA

Random linear maps are generated from seeds at runtime and only LoRA adapters are stored.

parameters: {"adapter_params":440000}

Mixture of Depths

Routes only a subset of tokens through full attention and MLP while others take identity residuals.

parameters: {"capacity":0.5}

Regularization

logit softcap

parameters: {"value":30}

Evaluation

sliding window eval

parameters: null

Test-Time Training

full TTT

parameters: {"chunk_size":24576,"epochs_per_chunk":4,"restricted_to_recurrent_layers":true}

Quantization

GPTQ

bits: 6

scope: block weights

Compression

Brotli

level: 11

Novel Contributions

DEQ Universal Transformer with fixed-point iteration and Anderson acceleration
Seed-LoRA using runtime-generated random linear maps with stored adapters only
Mixture of Depths token routing for compute-efficient training
4-loop depth recurrence with early parallel residuals and selective TTT
GPTQ int6/int8 compression with Brotli-11 artifact compression