PR #1589

open

Non-record: 30 experiments across 13 architectures (MLA, Pause Tokens, Eigenweight, 9 exotic ideas)

by nnm2602View on GitHub

val_bpb

1.3223

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

MLA

Compresses K and V projections through a shared latent bottleneck while keeping Q and output projections full-rank.

parameters: {"latent_dim":128,"kv_compression_ratio":2}

depth recurrence

Reuses layers multiple times to increase effective depth.

parameters: {"layers":[3,4],"repeats":2}

depth recurrence

Reuses layers multiple times to increase effective depth.

parameters: {"layers":[3,4,5],"repeats":2}

depth recurrence

Reuses layers multiple times to increase effective depth.

parameters: {"layers":[3,4],"repeats":3}

pause tokens

Inserts learned dummy tokens periodically to provide scratch space for computation.

parameters: {"num_pause":4,"interval":64}

pause tokens

Inserts learned dummy tokens periodically to provide scratch space for computation.

parameters: {"num_pause":8,"interval":32}

Eigenweight

Uses low-rank SVD-style factorization for model weights.

parameters: {"rank":64}

Eigenweight

Uses low-rank SVD-style factorization for model weights.

parameters: {"rank":128}

Eigenweight

Uses low-rank SVD-style factorization for model weights.

parameters: {"rank":256}

Eigenweight

Low-rank factorization with recurrence combined.

parameters: {"rank":128}

Eigenweight

Asymmetric rank allocation between attention and MLP layers.

parameters: {"attn_rank":32,"mlp_rank":96}

Eigenweight

Asymmetric rank allocation between attention and MLP layers.

parameters: {"attn_rank":96,"mlp_rank":32}

Eigenweight

Asymmetric rank allocation between attention and MLP layers.

parameters: {"attn_rank":128,"mlp_rank":32}

Eigenweight

Uses wider ambient dimension with low-rank factorization.

parameters: {"dimension":1024,"rank":64}

Eigenweight

Uses wider ambient dimension with low-rank factorization.

parameters: {"dimension":768,"rank":48}

Eigenweight

Uses wider ambient dimension with low-rank factorization.

parameters: {"dimension":1024,"rank":32}

Basis Sharing

Shares a cross-layer SVD basis across layers for each weight type.

parameters: {"rank":128}

Basis Sharing

Shares a cross-layer SVD basis across layers for each weight type.

parameters: {"rank":64}

Universal Transformer + ACT

Uses a shared layer with adaptive computation time halting.

parameters: {"max_steps":12}

Universal Transformer + ACT

Uses a shared layer with adaptive computation time halting.

parameters: {"max_steps":20}

Other

other

HyperNetwork-style weight generation for neurogenesis.

parameters: {"rank":32}

other

Hopfield energy-based language model.

parameters: {"embedding_dim":128}

other

Communicating agents with a learned message bottleneck.

parameters: {"message_dim":64}

other

Neural cellular automaton for weight or state evolution.

parameters: {"steps":50}

other

SIREN-based coordinate network for weight generation.

parameters: {"hidden_dim":256}

other

Seed model using random projection / generated weights.

parameters: {"dimension":2048}

other

Tensor network / matrix product state parameterization.

parameters: {"bond_dim":64}

Regularization

logit softcap

parameters: {"value":30}

weight tying

parameters: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

Systematic comparison of 30 experiments across 13 architectural ideas under a strict 1x H100, 10-minute budget.
MLA-style KV compression with latent dimension 128 achieving 1.3223 BPB with near-baseline performance.
Pause tokens as learned dummy tokens that improve matched-step learning quality with minimal parameter cost.
Low-rank Eigenweight rank sweep showing a clear rank-vs-BPB Pareto curve.
Evidence that MLP rank matters more than attention rank in asymmetric low-rank factorization.
Depth recurrence results showing that less recurrence can be better on a single GPU due to throughput constraints.
Negative results for several exotic architectures including SIREN weight generation, NCA, Hopfield energy models, hypernetworks, tensor MPS, and communicating agents.
Meta-insight that step throughput dominates final BPB under tight wallclock constraints.