PR #2058
openNon-record submission: Learned Adapters on Random Linear Maps
by pranavxiyerView on GitHub
val_bpb
1.1971
Architecture
Transformer
Optimizer
—
Artifact Size
15418110
Training Techniques
Architecture
MLP3x
Expanded MLP width to 3x and used learned adapters on random linear maps instead of storing full MLP matrices.
parameters: {"mlp_mult":3,"rank":160}
weight tying
Used tied embeddings for the input/output embedding weights.
parameters: null
Quantization
mixed int6/int8
bits: 6
scope: block weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":512}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Used learned adapters with random linear maps generated from a seed, storing only low-rank adapter matrices to reduce artifact size.
parameters: {"layers":12}
Novel Contributions
- Learned adapters on random linear maps to avoid storing full MLP matrices
- Low-rank adapter matrices acting like LoRA for both MLP projections
- Expanded to 12 transformer layers while staying within the artifact budget
- Mixed int6/int8 compression on selected blocks
- Sliding window evaluation with stride 512
- Wider MLP with rank 160 and mlp_mult 3