PR #1589

open

Non-record: 30 experiments across 13 architectures (MLA, Pause Tokens, Eigenweight, 9 exotic ideas)

by nnm2602View on GitHub
val_bpb
1.3223
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
MLA
Compresses K and V projections through a shared latent bottleneck while keeping Q and output projections full-rank.
parameters: {"latent_dim":128,"kv_compression_ratio":2}
depth recurrence
Reuses layers multiple times to increase effective depth.
parameters: {"layers":[3,4],"repeats":2}
depth recurrence
Reuses layers multiple times to increase effective depth.
parameters: {"layers":[3,4,5],"repeats":2}
depth recurrence
Reuses layers multiple times to increase effective depth.
parameters: {"layers":[3,4],"repeats":3}
pause tokens
Inserts learned dummy tokens periodically to provide scratch space for computation.
parameters: {"num_pause":4,"interval":64}
pause tokens
Inserts learned dummy tokens periodically to provide scratch space for computation.
parameters: {"num_pause":8,"interval":32}
Eigenweight
Uses low-rank SVD-style factorization for model weights.
parameters: {"rank":64}
Eigenweight
Uses low-rank SVD-style factorization for model weights.
parameters: {"rank":128}
Eigenweight
Uses low-rank SVD-style factorization for model weights.
parameters: {"rank":256}
Eigenweight
Low-rank factorization with recurrence combined.
parameters: {"rank":128}
Eigenweight
Asymmetric rank allocation between attention and MLP layers.
parameters: {"attn_rank":32,"mlp_rank":96}
Eigenweight
Asymmetric rank allocation between attention and MLP layers.
parameters: {"attn_rank":96,"mlp_rank":32}
Eigenweight
Asymmetric rank allocation between attention and MLP layers.
parameters: {"attn_rank":128,"mlp_rank":32}
Eigenweight
Uses wider ambient dimension with low-rank factorization.
parameters: {"dimension":1024,"rank":64}
Eigenweight
Uses wider ambient dimension with low-rank factorization.
parameters: {"dimension":768,"rank":48}
Eigenweight
Uses wider ambient dimension with low-rank factorization.
parameters: {"dimension":1024,"rank":32}
Basis Sharing
Shares a cross-layer SVD basis across layers for each weight type.
parameters: {"rank":128}
Basis Sharing
Shares a cross-layer SVD basis across layers for each weight type.
parameters: {"rank":64}
Universal Transformer + ACT
Uses a shared layer with adaptive computation time halting.
parameters: {"max_steps":12}
Universal Transformer + ACT
Uses a shared layer with adaptive computation time halting.
parameters: {"max_steps":20}
Other
other
HyperNetwork-style weight generation for neurogenesis.
parameters: {"rank":32}
other
Hopfield energy-based language model.
parameters: {"embedding_dim":128}
other
Communicating agents with a learned message bottleneck.
parameters: {"message_dim":64}
other
Neural cellular automaton for weight or state evolution.
parameters: {"steps":50}
other
SIREN-based coordinate network for weight generation.
parameters: {"hidden_dim":256}
other
Seed model using random projection / generated weights.
parameters: {"dimension":2048}
other
Tensor network / matrix product state parameterization.
parameters: {"bond_dim":64}
Regularization
logit softcap
parameters: {"value":30}
weight tying
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Systematic comparison of 30 experiments across 13 architectural ideas under a strict 1x H100, 10-minute budget.
  • MLA-style KV compression with latent dimension 128 achieving 1.3223 BPB with near-baseline performance.
  • Pause tokens as learned dummy tokens that improve matched-step learning quality with minimal parameter cost.
  • Low-rank Eigenweight rank sweep showing a clear rank-vs-BPB Pareto curve.
  • Evidence that MLP rank matters more than attention rank in asymmetric low-rank factorization.
  • Depth recurrence results showing that less recurrence can be better on a single GPU due to throughput constraints.
  • Negative results for several exotic architectures including SIREN weight generation, NCA, Hopfield energy models, hypernetworks, tensor MPS, and communicating agents.
  • Meta-insight that step throughput dominates final BPB under tight wallclock constraints.