PR #1783
open[record] val_bpb=1.1716 — DEQ Universal Transformer + Seed-LoRA + Mixture of Depths
by ismailntlView on GitHub
val_bpb
1.1716
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
depth recurrence
Recurrence through layers 3-5 with 4 virtual passes, increasing effective depth from a smaller physical stack.
parameters: {"layers":3,"passes":4}
weight tying
Tied embeddings are used to share parameters between input and output embeddings.
parameters: null
LeakyReLU
Uses LeakyReLU(0.5)^2 in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
GQA
Uses grouped-query style attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Uses sigmoid-gated skip connections between layers.
parameters: null
DEQ Universal Transformer
A single physical transformer block is iterated to a fixed point using Anderson acceleration and phantom gradients.
parameters: {"history_window":5,"unrolled_steps":4}
Seed-LoRA
Random linear maps are generated from seeds at runtime and only LoRA adapters are stored.
parameters: {"adapter_params":440000}
Mixture of Depths
Routes only a subset of tokens through full attention and MLP while others take identity residuals.
parameters: {"capacity":0.5}
Regularization
logit softcap
parameters: {"value":30}
Evaluation
sliding window eval
parameters: null
Test-Time Training
full TTT
parameters: {"chunk_size":24576,"epochs_per_chunk":4,"restricted_to_recurrent_layers":true}
Quantization
GPTQ
bits: 6
scope: block weights
Compression
Brotli
level: 11
Novel Contributions
- DEQ Universal Transformer with fixed-point iteration and Anderson acceleration
- Seed-LoRA using runtime-generated random linear maps with stored adapters only
- Mixture of Depths token routing for compute-efficient training
- 4-loop depth recurrence with early parallel residuals and selective TTT
- GPTQ int6/int8 compression with Brotli-11 artifact compression