PR #874

open

Non-record: Random Linear Maps + Learned Adapters (val_bpb=1.607, 1.92MB artifact)

by fieldingView on GitHub

val_bpb

1.6070

Architecture

Transformer

Optimizer

Muon

Artifact Size

1.92MB

Training Techniques

Architecture

RandomLinearWithAdapter

Uses fixed-seed random base weights for linear layers, with learned low-rank adapters added on top; base weights are regenerated at load time and not stored in the artifact.

parameters: {"rank":16}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Uses a 3x MLP expansion with relu-squared activation.

parameters: {"mlp_multiplier":3}

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"window_size":64}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adamw":true}

AdamW

weight_decay: null

momentum: null

other_params: null

Sequence Length

sequence_length

train_length: null

eval_length: 64

Novel Contributions

Random base linear projections regenerated from a fixed seed so they do not count toward artifact size
Learned low-rank adapters on top of random linear maps
Demonstration that a mostly-random-weight Transformer can still achieve competitive language modeling performance
Depth sweep showing a 4-5 layer sweet spot under a fixed training-time budget
Rank sweep showing smaller adapters can outperform larger ones under a fixed compute budget
Sliding-window evaluation improves reported BPB over standard float evaluation in the long run