PR #2090

open

[Non-record] Depth Recur + Randomized Linear Maps :: Parameter-efficient Repartitioning for Shared Modules

by SPTholeView on GitHub
val_bpb
1.2310
Architecture
Transformer
Optimizer
Muon
Artifact Size
5.07 MB

Training Techniques

Architecture
depth recurrence
Reuses 2 physical transformer blocks across 11 virtual layers in a U-Net encoder-decoder style shared architecture.
parameters: {"virtual_layers":11,"physical_blocks":2}
U-Net skip connections
Uses encoder-decoder skip connections with learned skip gates in the shared-block transformer.
parameters: {"encoder_blocks":1,"decoder_blocks":1}
weight tying
Shares physical transformer blocks across multiple virtual depths, including ALBERT-style single-block reuse in the ablation.
parameters: {"shared_blocks":2}
Gated Attention
PRISM-Adapt blends original and permuted Q/K activations with a learned gate.
parameters: null
GQA
Uses grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
Applies rotary positional embeddings; repartition permutations are designed to be RoPE-safe.
parameters: null
Quantization
GPTQ
bits: 6
scope: weights
int8
bits: 8
scope: embeddings
Compression
Brotli
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adamw_used_for":"scalars/embeds"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Per-virtual-layer Q/K activation repartitioning via deterministic channel permutations after projection, with a learned identity/permutation blend in PRISM-Adapt.
parameters: {"virtual_layers":11}

Novel Contributions

  • Depth recurrence with a 2-block shared U-Net transformer replacing 11 independent layers
  • Per-layer virtual diagonal adapters to recover depth specialization in shared modules
  • RoPE-safe Q/K activation repartitioning using deterministic per-virtual-layer channel permutations
  • Learned identity/permutation blending for Q/K activations that collapses to a static 50/50 mix
  • Ablation study showing PRISM-WO is the best shared-weight control and that raw random permutations hurt performance