PR #1546

open

[non record] Investigating the Tied-Embedding Bottleneck: Why Boundary Blocks Underperform and What It Means for 16MB Models

by SPTholeView on GitHub
val_bpb
1.0850
Architecture
Transformer
Optimizer
Muon
Artifact Size
16.44 MB

Training Techniques

Architecture
weight tying
Tied input embeddings and output classifier weights; submission investigates decoupling this bottleneck.
parameters: {"embed_dim":416,"model_dim":512}
depth recurrence
Recurrent reuse of transformer blocks to create virtual layers.
parameters: {"layers":3,"num_loops":2}
U-Net skip connections
Sigmoid-gated skip connections between encoder and decoder halves.
parameters: null
LeakyReLU
Uses squared LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
parallel residual routing
Two-lane parallel residual routing extended to later blocks.
parameters: {"parallel_start":7}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"freeze_blocks":2}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: null

Novel Contributions

  • Weight analysis showing a symmetric V-shaped capacity profile with weak boundary blocks
  • Identification of tied embeddings as a bottleneck for first and last transformer blocks
  • Embedding decoupling experiments that activate boundary blocks via learned projection layers
  • Observation that the learned projections are near-orthogonal rotations rather than true dimensional compression
  • Analysis of rate-distortion tradeoffs showing tighter clipping can worsen artifact size
  • Proposal of residual low-rank projection as a cheaper basis-rotation alternative