PR #1546

open

[non record] Investigating the Tied-Embedding Bottleneck: Why Boundary Blocks Underperform and What It Means for 16MB Models

by SPTholeView on GitHub

val_bpb

1.0850

Architecture

Transformer

Optimizer

Muon

Artifact Size

16.44 MB

Training Techniques

Architecture

weight tying

Tied input embeddings and output classifier weights; submission investigates decoupling this bottleneck.

parameters: {"embed_dim":416,"model_dim":512}

depth recurrence

Recurrent reuse of transformer blocks to create virtual layers.

parameters: {"layers":3,"num_loops":2}

U-Net skip connections

Sigmoid-gated skip connections between encoder and decoder halves.

parameters: null

LeakyReLU

Uses squared LeakyReLU activation in the MLP.

parameters: {"slope":0.5}

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

GQA

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

parallel residual routing

Two-lane parallel residual routing extended to later blocks.

parameters: {"parallel_start":7}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Quantization

GPTQ

bits: 6

scope: all

Compression

brotli

level: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"freeze_blocks":2}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: null

Novel Contributions

Weight analysis showing a symmetric V-shaped capacity profile with weak boundary blocks
Identification of tied embeddings as a bottleneck for first and last transformer blocks
Embedding decoupling experiments that activate boundary blocks via learned projection layers
Observation that the learned projections are near-orthogonal rotations rather than true dimensional compression
Analysis of rate-distortion tradeoffs showing tighter clipping can worsen artifact size
Proposal of residual low-rank projection as a cheaper basis-rotation alternative