PR #2058

open

Non-record submission: Learned Adapters on Random Linear Maps

by pranavxiyerView on GitHub
val_bpb
1.1971
Architecture
Transformer
Optimizer
Artifact Size
15418110

Training Techniques

Architecture
MLP3x
Expanded MLP width to 3x and used learned adapters on random linear maps instead of storing full MLP matrices.
parameters: {"mlp_mult":3,"rank":160}
weight tying
Used tied embeddings for the input/output embedding weights.
parameters: null
Quantization
mixed int6/int8
bits: 6
scope: block weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":512}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Used learned adapters with random linear maps generated from a seed, storing only low-rank adapter matrices to reduce artifact size.
parameters: {"layers":12}

Novel Contributions

  • Learned adapters on random linear maps to avoid storing full MLP matrices
  • Low-rank adapter matrices acting like LoRA for both MLP projections
  • Expanded to 12 transformer layers while staying within the artifact budget
  • Mixed int6/int8 compression on selected blocks
  • Sliding window evaluation with stride 512
  • Wider MLP with rank 160 and mlp_mult 3