PR #2090

open

[Non-record] Depth Recur + Randomized Linear Maps :: Parameter-efficient Repartitioning for Shared Modules

by SPTholeView on GitHub

val_bpb

1.2310

Architecture

Transformer

Optimizer

Muon

Artifact Size

5.07 MB

Training Techniques

Architecture

depth recurrence

Reuses 2 physical transformer blocks across 11 virtual layers in a U-Net encoder-decoder style shared architecture.

parameters: {"virtual_layers":11,"physical_blocks":2}

U-Net skip connections

Uses encoder-decoder skip connections with learned skip gates in the shared-block transformer.

parameters: {"encoder_blocks":1,"decoder_blocks":1}

weight tying

Shares physical transformer blocks across multiple virtual depths, including ALBERT-style single-block reuse in the ablation.

parameters: {"shared_blocks":2}

Gated Attention

PRISM-Adapt blends original and permuted Q/K activations with a learned gate.

parameters: null

GQA

Uses grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

RoPE

Applies rotary positional embeddings; repartition permutations are designed to be RoPE-safe.

parameters: null

Quantization

GPTQ

bits: 6

scope: weights

int8

bits: 8

scope: embeddings

Compression

Brotli

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adamw_used_for":"scalars/embeds"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

Per-virtual-layer Q/K activation repartitioning via deterministic channel permutations after projection, with a learned identity/permutation blend in PRISM-Adapt.

parameters: {"virtual_layers":11}

Novel Contributions

Depth recurrence with a 2-block shared U-Net transformer replacing 11 independent layers
Per-layer virtual diagonal adapters to recover depth specialization in shared modules
RoPE-safe Q/K activation repartitioning using deterministic per-virtual-layer channel permutations
Learned identity/permutation blending for Q/K activations that collapses to a static 50/50 mix
Ablation study showing PRISM-WO is the best shared-weight control and that raw random permutations hurt performance