PR #874

open

Non-record: Random Linear Maps + Learned Adapters (val_bpb=1.607, 1.92MB artifact)

by fieldingView on GitHub
val_bpb
1.6070
Architecture
Transformer
Optimizer
Muon
Artifact Size
1.92MB

Training Techniques

Architecture
RandomLinearWithAdapter
Uses fixed-seed random base weights for linear layers, with learned low-rank adapters added on top; base weights are regenerated at load time and not stored in the artifact.
parameters: {"rank":16}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Uses a 3x MLP expansion with relu-squared activation.
parameters: {"mlp_multiplier":3}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"window_size":64}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adamw":true}
AdamW
weight_decay: null
momentum: null
other_params: null
Sequence Length
sequence_length
train_length: null
eval_length: 64

Novel Contributions

  • Random base linear projections regenerated from a fixed seed so they do not count toward artifact size
  • Learned low-rank adapters on top of random linear maps
  • Demonstration that a mostly-random-weight Transformer can still achieve competitive language modeling performance
  • Depth sweep showing a 4-5 layer sweet spot under a fixed training-time budget
  • Rank sweep showing smaller adapters can outperform larger ones under a fixed compute budget
  • Sliding-window evaluation improves reported BPB over standard float evaluation in the long run